# Homework 6 (Due 5/24)

## Name:

## ID:

## Instructions
Run everything (select cell in the menu, and click Run all), export as pdf, and submit the pdf to gradescope. 

To export as pdf, you can use the following methods: (1) File -> download as -> pdf (2) print as pdf from browser.

**Q1**

In this problem, we show that ridge regression can reduce the variance of the model using a synthetic dataset. This problem is similar to this [example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge_variance.html#sphx-glr-auto-examples-linear-model-plot-ols-ridge-variance-py).

(1) Suppose our dataset is generated by the following model:

$$ y = 1 + x + \epsilon $$

Generate 1000 samples of $x$ from a uniform distribution between 0 and 1. Generate $\epsilon$ from a normal distribution with mean 0 and standard deviation 1. Generate $y$ using the above model.




In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge

N = 1000
np.random.seed(0)  # For reproducibility
x = np.random.uniform(0, 1, N)
epsilon = np.random.normal(0, 1, N)
y = 1 + x + epsilon


(2) We can imagine the previous dataset as the whole population. In practice, we can only observe some samples of the population. 

Repeat the following experiment 100 times:

i. Randomly choose 3 samples from the dataset (we can use ``np.random.choice``)

ii. Fit a linear regression model and collect the slop and intercept of the model. 

iii. Fit a ridge regression model with $\alpha=0.01$ and collect the slope and intercept of the model.


After this experiment, we have 100 slopes and 100 intercepts from the linear regression models and 100 slopes and 100 intercepts from the ridge regression models.

Compute and compare the mean and standard deviation of the slopes and the intercepts from the linear regression models and the ridge regression models.




In [2]:
lm = LinearRegression()
ridge = Ridge(alpha=0.01)

# Step 2: Repeat experiment 100 times
lm_slopes = []
lm_intercepts = []

ridge_slopes = []
ridge_intercepts = []

for _ in range(100):
    # Randomly select 2 samples
    idx = np.random.choice(range(1000), size=3, replace=False)
    x_samples = x[idx].reshape(-1, 1)  # Reshape for sklearn
    y_samples = y[idx]

    # Fit model
    lm.fit(x_samples, y_samples)
    ridge.fit(x_samples, y_samples)
    
    # Collect slope and intercept
    lm_slopes.append(lm.coef_[0])
    lm_intercepts.append(lm.intercept_)

    # Collect slope and intercept for Ridge
    ridge_slopes.append(ridge.coef_[0])
    ridge_intercepts.append(ridge.intercept_)

# Step 3: Compute mean and standard deviation
mean_slope = np.mean(lm_slopes)
std_slope = np.std(lm_slopes)

mean_intercept = np.mean(lm_intercepts)
std_intercept = np.std(lm_intercepts)

mean_slope_ridge = np.mean(ridge_slopes)
std_slope_ridge = np.std(ridge_slopes)

mean_intercept_ridge = np.mean(ridge_intercepts)
std_intercept_ridge = np.std(ridge_intercepts)

print("Linear Regression")
print(f"Slope: {mean_slope:.2f} +/- {std_slope:.2f}")
print(f"Intercept: {mean_intercept:.2f} +/- {std_intercept:.2f}")

print("Ridge Regression")
print(f"Slope: {mean_slope_ridge:.2f} +/- {std_slope_ridge:.2f}")
print(f"Intercept: {mean_intercept_ridge:.2f} +/- {std_intercept_ridge:.2f}")


Linear Regression
Slope: 1.44 +/- 4.58
Intercept: 0.77 +/- 2.80
Ridge Regression
Slope: 1.01 +/- 2.76
Intercept: 1.05 +/- 1.64


**Q2** 

Sometimes the number of features can be much larger than the number of samples. This is called the high-dimensional dataset.

In this problem, we compare Lasso and Ridge regression on a synthetic high-dimensional dataset with n = 20 and p = 100.

Each feature vector $X_0$, ... $X_{99}$ is generated from a normal distribution with mean 0 and standard deviation 1.

The true model is

$$ y = 3X_0 - X_1 + 5 X_2 $$

That is, only a small number of features are actually relevant to the target variable $y$. 


In [3]:
# DO NOT modify this cell
# Generate synthetic high-dimensional data
import numpy as np
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error

np.random.seed(0)
n = 20  # number of observations
p = 100  # number of features
X = np.random.randn(n, p)
true_coef = np.concatenate([np.array([3, -2, 5]), np.zeros(p - 3)])
y = np.dot(X, true_coef)

Fit a Lasso and a Ridge regression model to this dataset without the intercept term and use $\alpha=0.1$.

(1) Compute and compare the means square error of the two models. 

(2) Collect and compare the coefficents of the two models. 

(3) If we say feature $X_i$ is relevant if the coefficient $|\beta_i| > 0.1$, what are the indices of the relevant features identified by the Lasso model and the Ridge model? 




In [4]:
threshold = 0.01

# Lasso Regression
lasso = Lasso(alpha=0.1, fit_intercept=False)
lasso.fit(X, y)
lasso_coef = lasso.coef_

# find the index of |coefficent| > 0.1
lasso_relevant_features = np.where(np.abs(lasso_coef) > threshold)[0]

# Ridge Regression
ridge = Ridge(alpha=0.1, fit_intercept=False)
ridge.fit(X, y)
ridge_coef = ridge.coef_

# find the index of |coefficent| > 0.1
ridge_relevant_features = np.where(np.abs(ridge_coef) > threshold)[0]


# Results Comparison
# print("True Coefficients:", true_coef)  # First 10 coefficients for brevity
print("Lasso Estimated Coefficients:", lasso_coef)
print("Ridge Estimated Coefficients:", ridge_coef)

print("Lasso Relevant Features:", lasso_relevant_features)
print("Ridge Relevant Features:", ridge_relevant_features)

print("Lasso Mean Squared Error:", mean_squared_error(y, lasso.predict(X)))
print("Ridge Mean Squared Error:", mean_squared_error(y, ridge.predict(X)))


Lasso Estimated Coefficients: [ 2.8896224  -1.86204855  4.90596724  0.         -0.         -0.
 -0.          0.         -0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.         -0.          0.         -0.          0.
  0.         -0.          0.         -0.          0.          0.
 -0.          0.         -0.          0.         -0.         -0.
  0.         -0.          0.         -0.01705282 -0.         -0.
 -0.          0.          0.         -0.          0.         -0.
 -0.         -0.         -0.          0.         -0.         -0.
 -0.          0.          0.          0.         -0.          0.
 -0.         -0.         -0.         -0.         -0.          0.
 -0.         -0.         -0.          0.         -0.          0.
  0.         -0.          0.         -0.         -0.         -0.
 -0.          0.          0.         -0.          0.         -0.
  0.          0.         -0.          0.         -0.        