# Discussion 6

In this discussion section we study 1) PCA from the latent variable perspective, and 2) $\ell_1$ (LASSO) vs $\ell_2$ (ridge) regularization.

In [16]:
import numpy as np
import sklearn
from sklearn.decomposition import PCA
from ipywidgets import interactive
import ipywidgets as widgets
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso 

## Part (a): PCA from the Latent Variable Perspective ##

In this part, we study PCA from the latent variable perspective. 

Suppose the observed features $\vec{X} \in \mathbb{R}^m$ orginates from latent features $\vec{L} \in \mathbb{R}^{\ell}$, where $\ell \leq m$. Running PCA on many samples of $\vec{X}$ will recover the latent features.

Specifically in this example, we assume the latent variable is uniformly sampled from an ellipse. Specifically, let $$\vec{L'} = [2sin(\theta), cos(\theta))]^T \in \mathbb{R}^2$$ where $\theta$ is sampled from the uniform distribution $U(0, 2\pi)$. We then rotate $\vec{L'}$ with a random rotation matrix to obtain $\vec{L}$ .

Observed $\vec{X}$ follows a linear transformation of $\vec{L}$ plus some iid noise: $$\vec{X} = W\vec{L} + \vec{N}$$ where the noise $\vec{N} \sim \mathcal{N}(\vec{0}, \sigma^2 I)$.

We observe $n=1000$ data points of $\vec{X}$. Let's generate the data first.

In [17]:
def generate_l(N):
    theta = np.random.uniform(low=0.0, high=2*np.pi, size=N)
    L = np.vstack((2*np.sin(theta), np.cos(theta))).transpose()
    return L@orth_basis(2,2)
def generate_X(L, dim_m, sigma_n = 0.5):
    N, dim_l  = L.shape
    noise = np.random.normal(0, sigma_n, (N, dim_m))
    #W = np.random.rand(dim_l, dim_m)
    W = orth_basis(dim_m, dim_l)
    X = L@np.transpose(W) + noise
    return X, X@W

Then let's fit a PCA model on $X$ and project it on the first $\ell=2$ coordinates. Note we fix $l=2$ for the convenience of visualization.

In [18]:
def gen_data_and_fit_PCA(dim_m, sigma_n, normalize=True):
    ## generate data
    dim_l = 2
    N = 1000

    L = generate_l(N)
    X, X_hat = generate_X(L, dim_m, sigma_n)
    ## fit PCA
    pca = PCA(n_components=dim_l)
    pca.fit(X)
    L_hat = pca.fit_transform(X)
    L_rand = random_project_data(X, dim_l)
    if normalize:
        L = normalize_l(L)
        L_hat = normalize_l(L_hat)
        L_rand = normalize_l(L_rand)
        X_hat = normalize_l(X_hat)
    return L, L_hat, L_rand, X_hat
def orth_basis(dim, dim_l):
    ## This function creates orthogonal basis from random projection
    random_state = np.random
    H = np.eye(dim)
    D = np.ones((dim,))
    for n in range(1, dim):
        x = random_state.normal(size=(dim-n+1,))
        D[n-1] = np.sign(x[0])
        x[0] -= D[n-1]*np.sqrt((x*x).sum())
        # Householder transformation
        Hx = (np.eye(dim-n+1) - 2.*np.outer(x, x)/(x*x).sum())
        mat = np.eye(dim)
        mat[n-1:, n-1:] = Hx
        H = np.dot(H, mat)
        # Fix the last sign such that the determinant is 1
    D[-1] = (-1)**(1-(dim % 2))*D.prod()
    # Equivalent to np.dot(np.diag(D), H) but faster, apparently
    H = (D*H.T).T
    return H[:, :dim_l]
def random_project_data(X, dim_l):
    dim_m = X.shape[1]
    W = orth_basis(dim_m, dim_l)
    #W = np.random.rand(dim_m, dim_l)
    return X@W

Let's visualize the ground-truth latent variable $L$ and recovered $\hat{L}$. Note we need to normalize them first to eliminate scaling issues.

In [19]:
def normalize_l(L):
    return L/np.linalg.norm(L, ord = 2,axis=0, keepdims=True)
def plot_latent(L, L_hat, L_rand, X_hat):
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4)
    fig.set_size_inches(26.5, 6.5)
    fig.suptitle('Latent Variable: Clean Ground-Truth (1st) vs Noisy Ground-Truth (2nd) vs PCA Recovered (3rd) vs Random Projection (4th)', fontsize=25)
    ax1.plot(L[:, 0], L[:,1], ".")
    ax2.plot(X_hat[:, 0], X_hat[:,1], ".")
    ax3.plot(L_hat[:, 0], L_hat[:,1], ".")
    ax4.plot(L_rand[:, 0], L_rand[:,1], ".")
def generate_dim_m_widget():
    return widgets.IntSlider(
        value=50,
        min=5, 
        max=100, 
        step=5,
        description='dim_m',
        continuous_update= False)
def generate_sigma_n_widget():
    return widgets.FloatSlider(
        value=0.3,
        min=0.01, 
        max=1, 
        step=0.01,
        description='noise level',
        continuous_update= False)
def visualize(dim_m, sigma_n):
    L, L_hat, L_rand, X_hat = gen_data_and_fit_PCA(dim_m, sigma_n)
    plot_latent(L, L_hat, L_rand, X_hat)

In [20]:
interactive_plot = interactive(visualize,
                               dim_m=generate_dim_m_widget(),
                               sigma_n=generate_sigma_n_widget()  
                              )
interactive_plot

interactive(children=(IntSlider(value=50, continuous_update=False, description='dim_m', min=5, step=5), FloatS…

1. **Change $m$ (dim_m), what do you observe for the difference between ground-truth and recovered latent variable?**
2. **Change noise level, what do you observe for small and big noise value?**
3. **Why the recovered latent variable by PCA is aligned with x,y axis?**

## Part (b): $\ell_1$ (LASSO) vs $\ell_2$ (ridge) Regularization ##

In this part, we compare LASSO and ridge regression. The assumption is a bit different from the previous part: For data $\vec{X} \in \mathbb{R}^m$ and labels $\vec{y}$, $\vec{y}$ comes from latent variable $\vec{L} \in \mathbb{R}^l, l \leq m$, while $\vec{X}$ comes from $L$ and irrelevant feature $\vec{R} \in \mathbb{R}^{m-l}$. We aim to use LASSO and ridge regression to recover the latent variable $L$.

Specifically, let $m$ = 20 and number of sampled data points $N = 1000$. Assume $\vec{L} \sim \mathcal{N}(0, 0.5), \vec{R} \sim \mathcal{N}(0, 0.5)$. Data $X$ is a combination of $\vec{L}, \vec{R}$ and noise $\vec{n_x} \sim \mathcal{N}(0, 0.1)$: $$X = [\vec{L}, \vec{R}]^T + \vec{n_x}$$ 
Note that we set first $l$ dimension of $\vec{X}$ to be from the latent variable simplicity of visualization.

Labels $\vec{y}$ is a linear transformation of $\vec{L}$ plus noise $\vec{n_y} \sim \mathcal{N}(0, 0.1)$: $$y = W\vec{L} + \vec{n_y}$$

We will run LASSO and ridge regression on $(\vec{X}, \vec{y})$ to compare their performance on recovering the coefficient.

Let's generate the data first.

In [21]:
def gen_data(dim_l, dim_m, N = 10, sigma_l = 0.5, sigma_m = 0.5, sigma_n = 0.05):
    dim_y = 1
    
    noise_x = np.random.normal(0, sigma_n, (N, dim_m))
    noise_y = np.random.normal(1, sigma_n, (N, dim_y))
    
    L = np.random.normal(0, sigma_l, (N, dim_l)) #generate_l(N)
    R = np.random.normal(0, sigma_m, (N, dim_m - dim_l))
    
    X = np.hstack((L, R)) + noise_x

    W = np.random.rand(dim_l, dim_y)
    y = L@W + noise_y
    return X, y, W

To evaluate how well the method perform on recovering the weight, we calculate $\ell_2$ norm between learned coefficient and ground-truth weight.

Let's implement the function.

In [22]:
def pad_coef_gt(coef, coef_gt):
    coef, coef_gt = np.squeeze(coef), np.squeeze(coef_gt)
    coef_gt = np.squeeze(coef_gt)
    coef_gt_ = np.zeros(coef.shape)
    coef_gt_[:coef_gt.shape[0]] = coef_gt
    return coef_gt_
def diff(coef, coef_gt):
    coef_gt = pad_coef_gt(coef, coef_gt)
    coef, coef_gt = np.squeeze(coef), np.squeeze(coef_gt)
    return np.linalg.norm(coef - coef_gt)

Another model we consider is debiased LASSO. Basically, LASSO can be considered as a dimension reduction/selection model, as irrelevant feature weights are set to 0. We can use LASSO to first select non-zero features, and then perform ordinary least squares (OLS) on the dimension-reduced features and corresponding labels. We refer to it as <em>debiased LASSO</em>.

Let's implement the function.

In [23]:
def debiased_lasso(lassocoef, X, y, epsilon=1e-3):
    ind = np.where(np.abs(lassocoef)>epsilon)[0]
    OLSModel = Ridge(alpha = 0)
    OLSModel.fit(X[:, ind], y)
    
    #pad coef
    coef_ = np.zeros(X.shape[1])
    coef_[ind] = OLSModel.coef_
    
    return coef_ #OLSModel.coef_

Now let's run LASSO, ridge regression and debiased LASSO on $(\vec{X}, \vec{y})$ and plot coefficients of the ground truth and different models. We will also calculate difference (in terms of $\ell_2$ norm) between weights of both models in detecting the latent dimensions.

In [54]:
def plot_coef(W, ridgecoef, lassocoef, dlasso_coef):
    ridgecoef = np.squeeze(ridgecoef)
    lassocoef = np.squeeze(lassocoef)
    
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(13.5, 3.5)
    fig.suptitle('Coefficient', fontsize=25)
    #ax1.xticks(np.arange(0, len(ridgecoef), step=1))
    ax1.plot(np.arange(len(W)), np.squeeze(W), "r^", label="Ground truth")
    ax1.plot(np.arange(len(ridgecoef)), np.squeeze(ridgecoef), "bx", label="Ridge")
    ax1.plot(np.arange(len(lassocoef)), np.squeeze(lassocoef), "g.", label="LASSO")
    ax1.legend()
    
    ax2.plot(np.arange(len(W)), np.squeeze(W), "r^", label="Ground truth")
    ax2.plot(np.arange(len(lassocoef)), np.squeeze(lassocoef), "g.", label="LASSO")
    ax2.plot(np.arange(len(dlasso_coef)), np.squeeze(dlasso_coef), "m*", label="Debiased LASSO")
    ax2.legend()
    #ax1.show()
    
def generate_dim_l_widget():
    return widgets.IntSlider(
        value=2,
        min=2, 
        max=19, 
        step=1,
        description='dim_l',
        continuous_update= False)
def generate_dim_m_widget():
    return widgets.IntSlider(
        value=40,
        min=10, 
        max=200, 
        step=5,
        description='dim_m',
        continuous_update= False)
def generate_N_widget():
    return widgets.IntSlider(
        value=30,
        min=10, 
        max=200, 
        step=5,
        description='N',
        continuous_update= False)
def generate_sigma_n_widget():
    return widgets.FloatSlider(
        value=0.05,
        min=0.01, 
        max=0.5, 
        step=0.01,
        description='noise level',
        continuous_update= False)
def generate_lassoweight_widget():
    return widgets.FloatSlider(
        value=0.01,
        min=0.01, 
        max=0.1, 
        step=0.01,
        description='Lasso weight',
        continuous_update= False)

def visualize_ridge_lasso(dim_l,dim_m, N, sigma_n, lassoweight=0.01):
    X, y, W = gen_data(dim_l, dim_m, N, sigma_n = sigma_n)
    lassoModel = Lasso(alpha = lassoweight) 
    lassoModel.fit(X, y) 

    ridgeModel = Ridge(alpha = 0.01) 
    ridgeModel.fit(X, y)
    
    dlasso_coef = debiased_lasso(lassoModel.coef_, X, y)

    W = pad_coef_gt(ridgeModel.coef_, W)
    
    GT = np.zeros(dim_m)
    GT[range(dim_l)] = np.array([1]*dim_l)

    plot_coef(W, ridgeModel.coef_, lassoModel.coef_, dlasso_coef)
    print("l2 norm difference:")
    print("Ridge: ", diff(ridgeModel.coef_, W ) )
    print("LASSO:", diff(lassoModel.coef_, W ) )
    print("Debiased LASSO:", diff(dlasso_coef, W ) )
    
    #print("Ridge: ", FAR(ridgeModel.coef_, GT ) )
    #print("LASSO:", FAR(lassoModel.coef_, GT ) )

In [55]:
interactive_plot = interactive(visualize_ridge_lasso,
                               dim_l=generate_dim_l_widget(),
                               dim_m=generate_dim_m_widget(),
                               N = generate_N_widget(),
                               sigma_n=generate_sigma_n_widget(),
                               lassoweight=generate_lassoweight_widget()
                              )
interactive_plot

interactive(children=(IntSlider(value=2, continuous_update=False, description='dim_l', max=19, min=2), IntSlid…

**Look at the left figure:**
1. **Change dimension of latent variable $L$ (dim_l) and compare LASSO and ridge. Which model's learned coefficient is closer to the ground truth?**
2. **Change dimension of $M$ (dim_m), number of points $N$ and noise level $\sigma(n_x)$. What do you observe? (Please only change one variable at a time)**
3. **Change lasso weight $\lambda$ and find the best hyperparameter.**
**Look at the right figure:**

4. **Compare LASSO and debiased LASSO. Discuss which model performs better under specific circumstances (e.g. when feature dimension is larger than # data points, when there is much noise, etc.).**

Congrats! You have finished the notebook part. 

To sum up, we study PCA from latent variable's perspective and compare LASSO and ridge regression.

As you notice, PCA can be considered as an <em>unsupervised learning</em> method as it works on data $X$ and reduce its dimension; while ridge regression can be considered as a <em>supervised learning</em> method as it requires both data $X$ and label $y$. Is there any connection between PCA and ridge regression? As we show for the debiased LASSO, $\ell_1$ regularization can also be considered as a dimension reduction tool. Can PCA performs a similar job?

Please return to the worksheet to continue.