### Introduction
In this discussion section we study the learning problem for the canonical case of one-dimensional function $f$ on the segment $[-1,1]$. It goes in parallel with problem 8 from HW1. We are going to look at two feature families, and see which properties the data has for those features.



In [51]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import fixed


## Part a): Generating training data

In this part we generate the data from which we attempt to learn the function. The function of our choice for this section is 
$$
f(x) = \begin{cases}
8(x + 0.6)^2 - 0.72& x \in [-1, -0.3]\\
0& x\in[-0.3, 0.3]\\
4\sin(2\pi (x - 0.3))& x \in [0.3, 1].
\end{cases}
$$

We are going to sample points $\{x_i\}_{i=1}^n$ from $[-1, 1]$ and learn $f$ from values $\{f(x_i) + \varepsilon_i\}$, where $\{\varepsilon_i\}_{i=1}^n$ is some additive noise.


### Generating $x_i$
In this section we consider two ways of sampling $x_i$ for training data. <br>
1. $x_i$ sampled at random from the uniform distribution on $[-1,1]$.
2. $x_i$ from an evenly spaced grid on the interval $[-1,1]$.
   For example, for $n$ = 4, then we have the samples (-1, -0.5, 0, 0.5). Note that the endpoint 1, is not included in our training set. 
   This kind of evenly spaced samples gives rise to interesting properties of the feature matrix when using Fourier features as we will see in part c.


In [52]:
def generate_x(n, x_type, x_low=-1, x_high=1):  
    if x_type == 'grid':
        x = np.linspace(x_low, x_high, n, endpoint = False).astype(np.float64)

    elif x_type == 'uniform_random':
        x = np.sort(np.random.uniform(x_low, x_high, n).astype(np.float64))
        #Note that for making it easy for plotting we sort the randomly sampled x in ascending order
    else:
        raise ValueError   
    
    
    return x

### Generating $y_i$

        
Here we generate our observed values $\{f(x_i)\}$. Note that we are not adding noise yet.

In [53]:
def generate_y(x):
    y = np.zeros(len(x))
    y[x < -0.3] = 8 * (x[x < -0.3] + 0.6)**2 - 0.72
    y[x> 0.3] = np.sin(2 * np.pi * (x[x> 0.3]-0.3) * 4)
    return y

### Visualizing training data

The following cell plots $f$. The line shows the true function, and the dots indicate the data that we get from our measurements. 

In [54]:
def plot_training_data(x_type, n=64): 
    x_true = generate_x(x_type = 'grid', n=1000)
    x_train = generate_x(x_type=x_type, n=n)
    labels = ['y']
    
    y_true = generate_y(x=x_true)
    y_train = generate_y(x=x_train)
    plt.plot(x_true, y_true, linewidth = 0.5)
    plt.ylabel('y')
    plt.xlabel('x')
    plt.scatter(x_train, y_train, marker='o', label = 'f')
    plt.legend(bbox_to_anchor  = (1.03, 0.97))
    plt.show()
    
    

slider = widgets.RadioButtons(
    options=['uniform_random', 'grid'],
    description='x_type:',
    disabled=False
)
slider1 = widgets.IntSlider(
    value=65,
    min=65,
    max=1000,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)


interactive_plot = interactive(plot_training_data, x_type=slider, n=slider1)
output = interactive_plot.children[-1]
interactive_plot



interactive(children=(RadioButtons(description='x_type:', options=('uniform_random', 'grid'), value='uniform_r…

### Noise in training data

Now we add noise. We model this by assuming that our samples for y are corrupted by Additive White Gaussian Noise (AWGN).  The true function is given by $y = f(x)$. The amount of noise is controlled by the standard deviation of the noise which we denote by awgn_std. The noiseless case corresponds to awgn_std = 0

In [55]:
def add_awgn_noise(y, awgn_std=0):
    noise = np.random.normal(0, awgn_std, y.shape)
    y_noisy = y + noise    
    return y_noisy

### Visualizing noise in training data

In [56]:
def plot_noisy_training_data(awgn_std, n=64): 
    np.random.seed(7)
    x_true = generate_x(x_type = 'grid', n=1000)
    x_train = generate_x(x_type='uniform_random', n=n)
    y_true = generate_y(x=x_true)
    y_train_clean = generate_y(x=x_train)
    y_train = add_awgn_noise(y_train_clean, awgn_std=awgn_std)
    plt.plot(x_true, y_true, linewidth = 0.5, label = 'True function')
    plt.ylabel('y')
    plt.xlabel('x')
    plt.ylim([-4,4])
    plt.scatter(x_train, y_train, marker='o', label = 'Training samples')
    plt.legend(loc = 'upper right', bbox_to_anchor  = (1.43, 0.97))

    plt.show()
    

slider= widgets.FloatLogSlider(
    value=-5,
    base=2,
    min=-5, # min exponent of base
    max=5, # max exponent of base
    step=0.2, # exponent step
    description='awgn_std',
    continuous_update= False
)

interactive_plot = interactive(plot_noisy_training_data, awgn_std=slider)
output = interactive_plot.children[-1]
interactive_plot

interactive(children=(FloatLogSlider(value=0.03125, base=2.0, continuous_update=False, description='awgn_std',…

## Part b): Featurization- Lifting the training data
 


To apply linear regression in our learning problem we use "lifting" trick : instead of looking at data points $\{x_i\}_{i=1}^n$ we lift them into some high dimensional space with mapping ${\phi}$. Then our points turn into   $n$ vectors $\{{\phi}(x_i)\}_{i=1}^n$.

In problem 7 of HW1 we show that under some rather general conditions any function $f$ can be approximated by either polynomials or trigonometric series. This motivates our choice of two feature families that we are going to use:

### Polynomial features
We consider the d-dimensional features given by the Vandermonde polynomials:
$${ \phi}(x) = [1, x, x^2, \dots, x^{d-1}].$$ The code in this cell lets you choose $k$ and look at the plot of $x^k$.

In [57]:
from numpy.polynomial.polynomial import polyvander
def featurize_vandermonde(x, d, normalize = False):
    A = polyvander(x, d-1)
    for d_ in range(A.shape[1]):
        if normalize:
            A[:,d_] *=  np.sqrt(2*d_+1)
    return A


def plot_poly_features(d): 
    n = 128  
    d_max = 20
    x_type = 'uniform_random'
    np.random.seed(7)
    x_true = generate_x(x_type = 'grid', n=1000)
    x_train = generate_x(x_type=x_type, n=n)
    phi_train = featurize_vandermonde(x_train, d_max)
    phi_true = featurize_vandermonde(x_true, d_max)

    plt.plot(x_true, phi_true[:,d], linewidth = 0.5)
    plt.scatter(x_train, phi_train[:,d], marker='o')
    plt.ylim([-1.2,1.2])
    plt.xlabel('x')
    plt.ylabel('$\phi(x)$')
    plt.show()


slider = widgets.IntSlider(
    value=0,
    min=0,
    max=10,
    step=1,
    description='Feature # k:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)


interactive_plot = interactive(plot_poly_features, d=slider)
output = interactive_plot.children[-1]
interactive_plot

interactive(children=(IntSlider(value=0, continuous_update=False, description='Feature # k:', max=10), Output(…

### Fourier features
We consider the d-dimensional real Fourier features given by:
    $$\phi(x) = [1, \sin(\pi x), \cos(\pi x), \sin(2 \pi x), \cos(2\pi x), \dots,  \sin (r \pi x), \cos(r \pi x)],$$
    where $r = \frac{d-1}{2}$.
    
Note that by this convention we require $d$ to be an odd integer. The code in this cell lets you choose $k$ and look at the plot of the $k$-th coordinate of $\phi$ from above.

In [58]:
def featurize_fourier(x, d, normalize = False):
    assert (d-1) % 2 == 0, "d must be odd"
    max_r = int((d-1)/2)
    n = len(x)
    A = np.zeros((n, d))
    A[:,0] = 1
    for d_ in range(1,max_r+1):
        A[:,2*(d_-1)+1] =  np.sin(d_*x*np.pi)
        A[:,2*(d_-1)+2] =  np.cos(d_*x*np.pi)
    
    if normalize:
        A[:,0] *= (1/np.sqrt(2))
        A *= np.sqrt(2)
    return A

def plot_fourier_features(x_type,d): 
    n = 128  
    d_max = 21
    np.random.seed(7)
    x_true = generate_x(x_type = 'grid', n=1000)
    x_train = generate_x(x_type=x_type, n=n)
    phi_train = featurize_fourier(x_train, d_max)
    phi_true = featurize_fourier(x_true, d_max)
    
    plt.plot(x_true, phi_true[:,d], linewidth = 0.5)
    plt.scatter(x_train, phi_train[:,d], marker='o')
    plt.ylim([-1.2,1.2])
    plt.xlabel('x')
    plt.ylabel('$\phi(x)$')
    plt.show()


slider1 = widgets.RadioButtons(
    options=['uniform_random', 'grid'],
    description='x_type:',
    disabled=False
)
    
    
slider2 = widgets.IntSlider(
    value=0,
    min=0,
    max=20,
    step=1,
    description='Feature # k:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

interactive_plot = interactive(plot_fourier_features, d=slider2, x_type=slider1)
output = interactive_plot.children[-1]
interactive_plot

interactive(children=(RadioButtons(description='x_type:', options=('uniform_random', 'grid'), value='uniform_r…

In [59]:
def featurize(x, d, phi_type, normalize = False):
    function_map = {'polynomial':featurize_vandermonde, 'fourier':featurize_fourier}
    return function_map[phi_type](x,d,normalize)

**Sanity check:** answer the following questions:
1. How do we generate the data?

*solution:* First, we either take points $\{x_i\}_{i=1}^n$ from a grid on $[-1,1]$ or sample them from a uniform distribution. Then we generate values $y_i = f(x_i) + \varepsilon_i$, where $\varepsilon_i$ is gaussian noise.
2. Why is our data a matrix?

*solution:* We turn the points $x_i$ into vectors $\phi(x_i)$. Stacking those vectors forms a matrix.
3. What does approximating a function by polynomials have to do with linear regression with polynomial features?

*solution:* Linear functions on $\mathbb{R}^d$ have the form $\ell({\bf z}) = {\bf v}^\top {\bf z}$, where ${\bf z}\in \mathbb{R}^d$  is the argument and ${\bf v} \in \mathbb{R}^d$ is the vector of coefficients. If we plug in ${\bf z} = \phi(x) = [1, x, x^2, \dots, x^{d-1}]$, we have $\ell({\bf z}) = \sum_{i=1}^d v_i x^{i-1}$ - a polynomial in $x$. Thus, linear functions on featurized points are exactly polynomials on initial points.

## Part c) conditioning of $\Phi^T \Phi$

As we will see in part g, a very important property of the data matrix is the shape of the array of its singular values. In this part we are going to explore how our choice of features and sampling influences those singular values. However, before we start the experiements, let's look at problem 9b from HW1. 
1. Do problem 9b from HW1 (you can assume part a). What does the result of part b mean in terms of singular values of the data matrix for measurements on a grid and Fourier features?

*solution:* 

Problem 9b says that if the points come from the grid and we use fourier features, the data matrix becomes orthogonal. All singular values of an orthogonal matrix are equal to 1.  Therefore, if we take points from the grid and choose Fourier features, all singular values of the resulting data matrix will be the same. This corresponds to the yellow graph on the plot below being a horizontal line.

2. Can you guess how the singular values will behave when $x_i$ are sampled uniformly at random for polynomial and Fourier features? (Hint: on the one hand, uniform random sampling seems somewhat similar to making measurements on a grid. On the other hand, look at the plots of the features above. Polynomial features have small values and look similar. What does it mean in terms of singular values?)

Do the simulation below to check your intuition.

*solution:* As the hint suggests, if we sample from a uniform distribution but still use Fourier features, the singular values will still be mostly the same. However, if we use polynomial features, the resulting matrix has many small columns which results in many small singular values. 


This aligns well with the simulations below: the green graph starts close to the yellow graph and only diverges when $d$ becomes large, while the blue graph is decreasing from the very beginning. 

In [60]:
def plot_eig_values(n, d, seed, lambda_ridge):
    np.random.seed(seed)
    x_type_phi_type_pairs = [('uniform_random', 'polynomial'), ('uniform_random', 'fourier'), ('grid', 'fourier')]
    
    colors = ['blue', 'green', 'orange']
    
    for k, (x_type, phi_type) in enumerate(x_type_phi_type_pairs):
        x_train = generate_x(x_type=x_type, n=n)
        phi_train = featurize(x_train, d, phi_type)
        eig_vals,_ = np.linalg.eig(phi_train.T @ phi_train + lambda_ridge*np.eye(d)) 
        

        
        eig_vals = np.sort(np.abs(eig_vals))[::-1]
        plt.plot(eig_vals, 'o-', c = colors[k], label = 'x_type: ' + str(x_type) + ', phi_type: ' + str(phi_type))
        
        
    plt.legend(bbox_to_anchor = (1.73, 0.97))
    plt.yscale('log')
    plt.ylim(1e-20, 1e4)
    plt.xlim([-1, d+1])

    plt.show()


In [61]:
seed = 1
lambda_ridge = 0


slider = widgets.IntSlider(
    value=11,
    min=1,
    max=65,
    step=2,
    description='d:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
slider1 = widgets.IntSlider(
    value=65,
    min=65,
    max=200,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)


interactive_plot =interactive(plot_eig_values,n = slider1, d = slider, seed = fixed(seed),
                              lambda_ridge = fixed(lambda_ridge))
interactive_plot

interactive(children=(IntSlider(value=65, continuous_update=False, description='n:', max=200, min=65, step=2),…

## Part d): Linear Regression to learn the 1d-function in feature space

Now that we are done with setting up our data, we start  defining the tools that we will use for learning: ordinary least squares and ridge regression.




#### Least squares
To learn the function we will perform linear regression in the lifted feature space, i.e. we learn a set of coefficients $w \in \mathbb{R}^d$ to minimize the least-squares loss:

$$\ell(w) = \frac{1}{n} \| y - \phi w \|_2^2.$$

1. Add the code to the places marked with "TODO"

In [62]:
from sklearn.linear_model import LinearRegression
def solve_ls(phi, y):
    
    LR = LinearRegression(fit_intercept=False, normalize=False)

#     LR.fit(TODO)
#     coeffs = TODO
    LR.fit(phi, y)
    coeffs = LR.coef_

    
    
    loss = np.mean((y- phi@coeffs)**2)
    return coeffs, loss

#### Ridge

Ridge regression is a celebrated tool to combat noise. We   add a regularizing penalty term to the least squares objective to perform ridge regression where we minimize the loss,

$$\ell(w) = \frac{1}{n} (\| y - \phi w \|_2^2+ \lambda \| w \|_2^2).$$

We will explore the effect of the regularizing term $\lambda$ in part g

2. Add the code to the places marked with "TODO"


In [63]:
from sklearn.linear_model import Ridge

def solve_ridge(phi, y, lambda_ridge=0):
    
    Rdg = Ridge(fit_intercept=False, normalize=False, alpha = lambda_ridge)

#     Rdg.fit(TODO)
#     coeffs = TODO
    Rdg.fit(phi, y)
    coeffs = Rdg.coef_


    
    loss = np.mean((y- phi@coeffs)**2) + np.mean(coeffs**2)
    return coeffs, loss


## Part e): Comparing polynomial and Fourier features


In this part we compare our two families of features (Fourier and polynomial) in terms of approximating our chosen function. The sampling of data is chosen to be uniform random for both families. Note that we have no noise yet.

Do the following:

1. Choose $3$ different values of $n$: small, medium and large.  For each such $n$ find the number of features that is needed to approximate our function better. Wich feature family needs more features? Is it much more?

*solution: * For every value of $n$ both feature families need relatively close numbers of features (around 30 when $n$ is large). Polynomial features do a little  bit worse for small $n$ mostly because they struggle to approximate the oscilating part ($x > 0.3$).

2. Compare the learned weights. Try to explain, how what you see is related to the singular values of the data matrices. (You will be able to give a more rigorous answer to this question in part g.)

*solution: * The weights for polynomial features are of much higher order of magnitude. That happens because the data matrix in polynomial features has small singular numbers and therefore its pseudoinverse is large.

2. Fix $d=17$ and vary $n$. What do you see as $n$ grows? 

*solution: * when $n$ grows our prediction converges to some fixed function. This happens because 17 features is not enough to closely approximate the true function, so we learn the best possible approximation instead.


In [64]:
def get_plot_data( phi_type, d, w, n_plot = 1000):
    x_plot= generate_x(x_type = 'grid', n=n_plot)
    y_plot_true = generate_y(x=x_plot)
    phi_plot = featurize(x_plot, d, phi_type)
    
    return x_plot, y_plot_true, phi_plot @ w

def gen_and_solve(n, d, x_type, phi_type, seed = 1, awgn_std = 0, lambda_ridge = 0):
    np.random.seed(seed)
    
    x_train = generate_x(x_type=x_type, n=n)
    phi_train = featurize(x_train, d, phi_type)
    y_train = generate_y(x=x_train)
    
    if awgn_std != 0:
        y_train = add_awgn_noise(y_train, awgn_std)
        
    if lambda_ridge == 0:
        w, loss = solve_ls(phi_train, y_train)

    else:
        w, loss = solve_ridge(phi_train, y_train, lambda_ridge)


    return x_train, y_train, w, loss


def visualize_(x_train, y_train,  phi_type, d, w, loss, n_plot = 1000, n_fit = 1000):
    x_plot, y_plot_true, y_plot_pred = get_plot_data( phi_type, d, w, n_plot)
    plt.plot(x_plot, y_plot_true, label = 'True function')
    plt.scatter(x_train, y_train, marker='o', s=20, label = 'Training samples')
    plt.plot(x_plot, y_plot_pred, 'o-', ms=2, label = 'Learned function')

    

    plt.title("Train loss:" + str("{:.2e}".format(loss)))
    plt.ylim([-1.5, 1.5])
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend(bbox_to_anchor = (1.03, 0.97))
    plt.show()
        
    markerlines, stemlines,  baseline = plt.stem(np.arange(d), w, 'b', 'o',  use_line_collection=True)
    plt.setp(stemlines, 'color', plt.getp(markerlines,'color'))
    plt.xlabel('feature #(k)')
    plt.ylabel('weight')
    plt.show()

def plot_true_and_predicted(n, n_plot, n_fit, x_type, phi_type, seed, awgn_std, lambda_ridge, d): 
    x_train, y_train, w, loss = gen_and_solve(n, d, x_type,  phi_type, awgn_std = awgn_std, lambda_ridge = lambda_ridge, seed = seed)
    fit_mse = visualize_(x_train, y_train,  phi_type, d, w, loss, n_plot , n_fit)   


In [65]:

def get_params2():
    n = 64
    n_plot = 1000
    n_fit = 10000
    x_type = 'uniform_random'
    seed = 1
    awgn_std = 0
    lambda_ridge = 0
    return n, n_plot, n_fit, x_type, seed, awgn_std, lambda_ridge



In [66]:
n, n_plot, n_fit, x_type, seed, awgn_std, lambda_ridge = get_params2()
slider1 = widgets.IntSlider(
    value=1,
    min=1,
    max=65,
    step=2,
    description='d:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
slider2 = widgets.IntSlider(
    value=65,
    min=65,
    max=1000,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
    

phi_type = 'polynomial'
print("Polynomial features")

interactive_plot =interactive(plot_true_and_predicted,n = slider2, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= fixed(awgn_std), lambda_ridge = fixed(lambda_ridge), d = slider1)
interactive_plot


Polynomial features


interactive(children=(IntSlider(value=65, continuous_update=False, description='n:', max=1000, min=65, step=2)…

In [67]:
n, n_plot, n_fit, x_type, seed, awgn_std, lambda_ridge = get_params2()

slider3 = widgets.IntSlider(
    value=1,
    min=1,
    max=65,
    step=2,
    description='d:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
slider4 = widgets.IntSlider(
    value=65,
    min=65,
    max=1000,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

phi_type = 'fourier'
print("Fourier features")

interactive_plot =interactive(plot_true_and_predicted,n = slider4, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= fixed(awgn_std), lambda_ridge = fixed(lambda_ridge), d = slider3)
interactive_plot


Fourier features


interactive(children=(IntSlider(value=65, continuous_update=False, description='n:', max=1000, min=65, step=2)…

## Part f): Effect of noise


 

Next we will see the effect of noise in our learning method. For this purpose we will consider the three kinds of input data/feature combinations:
1. uniform randomly sampled x, polynomial features
2. uniform randomly sampled x, fourier features
3. evenly spaced x, fourier features


Answer the following questions:
1. How does the influence of the noise change as we increase the number of features?

*solution: * The influence of noise increases when the number of features increases.

2. How does the influence of the noise change as we increase the number of samples?

*solution: * The influence of the noise decreases as we increase the number of samples.

3. What is the shape of the dependence of the predicted function on $\sigma$(e.g. quadratic, exponential, etc.)?  (Hint: recall that the solution to least squares is given by the formula $w^* = (\Phi^\top \Phi)^{-1} \Phi^\top y$, and we generate $y$ by adding noise scaled by $\sigma$ to the true signal).

*solution: * Predicted function is linear in $w$, $w$ linearly depends on $y$, and $y$ has affine dependence on $\sigma$. Thus, the predicted function has affine dependence of $\sigma$ - scaling sigma linearly scales the error in the prediction.


In [68]:
def get_params3():
    n = 100
    n_plot = 1000
    n_fit = 10000
    seed = 1
    awgn_std = 1e-1
    lambda_ridge = 0
    return n, n_plot, n_fit, seed, awgn_std, lambda_ridge

In [69]:
n, n_plot, n_fit, seed, awgn_std, lambda_ridge = get_params3()
slider1 = slider = widgets.RadioButtons(
    options=[11, 31, 51],
    description='d:',
    disabled=False
)
slider2= widgets.FloatSlider(
    value=0.,
    min=0., 
    max=1, 
    step=0.01,
    description='$\sigma$',
    continuous_update= False
)
slider3 = widgets.IntSlider(
    value=65,
    min=65,
    max=1000,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
x_type = 'uniform_random'


phi_type = 'polynomial'
print("Uniform random x, Polynomial features")
interactive_plot =interactive(plot_true_and_predicted,n = slider3, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= slider2, lambda_ridge = fixed(lambda_ridge), d = slider1)
interactive_plot


Uniform random x, Polynomial features


interactive(children=(IntSlider(value=65, continuous_update=False, description='n:', max=1000, min=65, step=2)…

In [70]:
n, n_plot, n_fit, seed, awgn_std, lambda_ridge = get_params3()
slider4 = slider = widgets.RadioButtons(
    options=[11, 31, 51],
    description='d:',
    disabled=False
)

slider5= widgets.FloatSlider(
    value=0.,
    min=0., 
    max=1, 
    step=0.01,
    description='$\sigma$',
    continuous_update= False
)
slider6 = widgets.IntSlider(
    value=65,
    min=65,
    max=1000,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
x_type = 'uniform_random'

phi_type = 'fourier'
print("Uniform random x, Fourier features")

interactive_plot =interactive(plot_true_and_predicted,n = slider6, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= slider5, lambda_ridge = fixed(lambda_ridge), d = slider4)
interactive_plot

Uniform random x, Fourier features


interactive(children=(IntSlider(value=65, continuous_update=False, description='n:', max=1000, min=65, step=2)…

In [71]:
n, n_plot, n_fit, seed, awgn_std, lambda_ridge = get_params3()
slider7 = widgets.RadioButtons(
    options=[11, 31, 51],
    description='d:',
    disabled=False
)
slider8= widgets.FloatSlider(
    value=0.,
    min=0., 
    max=1, 
    step=0.01,
    description='$\sigma$',
    continuous_update= False
)
slider9 = widgets.IntSlider(
    value=65,
    min=65,
    max=1000,
    step=2,
    description='n:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
x_type = 'grid'


phi_type = 'fourier'
print("Evenly spaced x, Fourier features")

interactive_plot =interactive(plot_true_and_predicted,n = slider9, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= slider8, lambda_ridge = fixed(lambda_ridge), d = slider7)
interactive_plot

Evenly spaced x, Fourier features


interactive(children=(IntSlider(value=65, continuous_update=False, description='n:', max=1000, min=65, step=2)…

## Part g) Ridge regression

Ridge regression that we defined in part d is a celebrated tool to combat noise. Further in this course we will develop theoretical understanding of how adding penalty helps do that, but for now we will just observe how it influences the solution to our learning problem, and how that influence is related to the spectral properties of the data matrix.



Before we start, let's think about the problem from the theoretical point of view: problem 6c from HW1 states that after a suitable change of coordinates, the $i$-th coordinate of  the solution to ridge regression can be obtained from the corresponding coordinate of $\bf{U}^\top \vec{y}$ by multiplication by $\frac{\sigma_i}{\sigma_i^2 + \lambda}$, where $\sigma_i$ is the $i$-th singular value of $\bf{X}$ (or zero if $i$ is greater than the rank of $\bf{X}$.)

1. What happens if $\lambda = 0$ and $\sigma_i$ is small?

*solution: * in this case $\frac{\sigma_i}{\sigma_i^2 + \lambda} = \sigma_i^{-1}$ is a large number, and the corresponding coordinate of the weight vector will be a large multiple of the analogous coordinate of $\bf{U}^\top \vec{y}$. However, increasign $\lambda$ a little would make that factor much smaller(e.g. making $\lambda = \sigma_i$ would make it less than 1). 
2. Suppose $\sigma_i$ are all large. Will small $\lambda$ influence the performance?

*solution: * we can see that if $\sigma_i$ is large then $\frac{\sigma_i}{\sigma_i^2 + \lambda}$ is not influenced much by small $\lambda$: decreasing $\lambda$ to zero would only make it $1 + \lambda/\sigma_i^2$ larger, which is close to 1 for small $\lambda$ and large $\sigma$. Thus, if all $\sigma_i$ are small, our predictions would not be sensitive to small amounts of regularization. 

3. How do you think $\sigma_i$ will change as we add more data points (recall that we are in the regime where $n > d$)? How will it influence our choice of $\lambda$?(Hint: Think about Frobenius norm. What happens to it when you add rows to a matrix? Now recall that Frobenius norm is equal to the sum of squared singular values.)

*solution: * as we could see that from the experiements in part c, adding data points increases singular values. This happens because adding a column ${\bf x}$ to a matrix ${\bf X}$ adds ${\bf x}{\bf x}^\top$ to matrix ${\bf X}{\bf X}^\top$. Since the matrix ${\bf x}{\bf x}^\top$  is psd, adding it increases singular values.



Now run the experiements below and answer the following questions:
1. How does the amount of regularization you need to apply depend on the number of samples?

*solution: * adding samples makes prediction less sensitive to regularization, therefore larger values of $\lambda$ are needed to make difference. It aligns well with intuition developed above: adding samples makes singular values larger.

2. How does the amount of regularization you need to apply depend on the number of features?

*solution: * when we add features, our prediction becomes more sensitive to $\lambda$, therefore smaller values of $\lambda$ are needed. It aligns well with intuition developed above: we saw in part c that adding features adds small singular values.

3. For which choice of three experiements below the learned function is the most sensitive to $\lambda$? 

*solution: * In the first experiement (polynomial features), the prediction is the most sensitive to $\lambda$. Tn the last (fourier features, grid) --- the least. Once again, it aligns well with what we know about singular values: data matrix in polynomial features has many small singular values, but in fourier features all singular values are relatively large.

4. Do your observations align well with the intuition we developed above?

*solution: * they do, as discussed above.

In [72]:
def get_params4():
    n = 65
    n_plot = 1000
    n_fit = 10000
    seed = 1
    awgn_std = 3 * 1e-1
    return n, n_plot, n_fit, seed, awgn_std


In [73]:
n, n_plot, n_fit, seed, awgn_std = get_params4()

slider1 = widgets.RadioButtons(
    options=[11, 31, 51],
    description='d:',
    disabled=False
)

slider2= widgets.FloatLogSlider(
    value=-50,
    base=2,
    min=-50, 
    max=10,
    step=1, 
    description='$\lambda$',
    continuous_update= False
)

slider3 = widgets.RadioButtons(
    options=[65, 150, 500, 1000],
    description='n:',
    disabled=False
)

x_type = 'uniform_random'

phi_type = 'polynomial'
print("Uniform random x, Polynomial features")
interactive_plot =interactive(plot_true_and_predicted,n = slider3, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= fixed(awgn_std), lambda_ridge = slider2, d = slider1)
interactive_plot


Uniform random x, Polynomial features


interactive(children=(RadioButtons(description='n:', options=(65, 150, 500, 1000), value=65), FloatLogSlider(v…

In [74]:
n, n_plot, n_fit, seed, awgn_std = get_params4()
slider4 = widgets.RadioButtons(
    options=[11, 31, 51],
    description='d:',
    disabled=False
)

slider5= widgets.FloatLogSlider(
    value=-50,
    base=2,
    min=-50,
    max=10, 
    step=1,
    description='$\lambda$',
    continuous_update= False
)

slider6 = widgets.RadioButtons(
    options=[65, 150, 500, 1000],
    description='n:',
    disabled=False
)

x_type = 'uniform_random'


phi_type = 'fourier'
print("Uniform random x, Fourier features")

interactive_plot =interactive(plot_true_and_predicted,n = slider6, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= fixed(awgn_std), lambda_ridge = slider5, d = slider4)
interactive_plot

Uniform random x, Fourier features


interactive(children=(RadioButtons(description='n:', options=(65, 150, 500, 1000), value=65), FloatLogSlider(v…

In [75]:
n, n_plot, n_fit, seed, awgn_std = get_params4()
slider7 = widgets.RadioButtons(
    options=[11, 31, 51],
    description='d:',
    disabled=False
)

slider8= widgets.FloatLogSlider(
    value=-50,
    base=2,
    min=-50,
    max=10, 
    step=1,
    description='$\lambda$',
    continuous_update= False
)

slider9 = widgets.RadioButtons(
    options=[65, 150, 500, 1000],
    description='n:',
    disabled=False
)

x_type = 'grid'


phi_type = 'fourier'
print("Evenly spaced x, Fourier features")

interactive_plot =interactive(plot_true_and_predicted,n = slider9, n_plot = fixed(n_plot), n_fit = fixed(n_fit), 
                              x_type = fixed(x_type), phi_type = fixed(phi_type),seed = fixed(seed),
                              awgn_std= fixed(awgn_std), lambda_ridge = slider8, d = slider7)
interactive_plot

Evenly spaced x, Fourier features


interactive(children=(RadioButtons(description='n:', options=(65, 150, 500, 1000), value=65), FloatLogSlider(v…