### Problem 3.1: Optimal Tuning Parameters

In cross-validation, we discussed choosing the tuning parameter values that minimized the cross-validation error. Another approach, called the "one-standard error" rule [ISL pg 214, ESL pg 61], uses the values corresponding to the least complex model whose cv error is within one standard error of the best model. The goal of this assignment is to compare these two rules.

Simulate the data using the function in Python (make_friedman1 from scikit-learn) and fit Lasso regression models using the Lasso class from scikit-learn. The tuning parameter $\lambda$ (corresponding to the penalty on the coefficient magnitude) is the one we will focus one. Generate training data, use k-fold cross-validation to get $\lambda_{\rm min}$ and $\lambda_{\rm 1SE}$, generate test data, make predictions for the test data, and compare performance of the two rules under a squared error loss using a hypothesis test.  

Choose reasonable values for:

- Number of cv folds ($K$)
    - Note: you are free to use repeated CV, repeated hold-outs, or bootstrapping instead of plain cross-validation; just be sure to describe what do did so it will be easier to grade.
- Number of training and test observations
- Number of simulations
- If everyone uses different values, we will be able to see how the results change over the different settings.
- Don't forget to make your results reproducible (e.g., set seed)

This pseudo code will get you started:

In [None]:
import numpy as np
from sklearn.datasets import make_friedman1
from sklearn.linear_model import LassoCV, RidgeCV
from sklearn.metrics import mean_squared_error

#-- Settings
n_train = # number of training obs
n_test =  # number of test obs
K =      # number of CV folds
alpha =  # 1 for lasso and 0 for ridge
M =      # number of simulations

#-- Data Generating Function
def getData(n):
    return make_friedman1(n_samples=n, n_features=10, noise=2.0, random_state=None)

# Initialize lists to store performance metrics
mse_list = []

np.random.seed(321)  # Set Seed Here

for m in range(M):

    # 1. Generate Training Data
    X_train, y_train = getData(n_train)

    # 2. Build Training Model using cross-validation
    if alpha == 1:  # Lasso
        model = LassoCV(cv=K).fit(X_train, y_train)
    elif alpha == 0:  # Ridge
        model = RidgeCV(cv=K).fit(X_train, y_train)

    # 3. Get the alpha/lambda that minimizes CV error
    optimal_alpha = model.alpha_

    # 4. Generate Test Data
    X_test, y_test = getData(n_test)

    # 5. Predict y values for test data
    y_pred = model.predict(X_test)

    # 6. Evaluate predictions
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

#-- Compare
# Convert to numpy array for easier statistical analysis
mse_array = np.array(mse_list)
# Here, you can perform various statistical tests on mse_array or compare it with other methods.

a. Code for the simulation and performance results.

In [None]:
#your answer here.

b. Description and results of a hypothesis test comparing $\lambda_{\rm min}$ and $\lambda_{\rm 1SE}$.

Your answere here.

### Problem 3.2 Prediction Contest: Real Estate Pricing

This problem uses the [realestate-train](https://github.com/Hyunglok-Kim/EN5422_EV4238/blob/main/realestate-train.csv) and [realestate-test](https://github.com/Hyunglok-Kim/EN5422_EV4238/blob/main/realestate-test.csv) (click on links for data).

The goal of this contest is to predict sale price (in thousands) (`price` column) using an *elastic net* model. Evaluation of the test data will be based on the root mean squared error ${\rm RMSE}= \sqrt{\frac{1}{m}\sum_i (y_i - \hat{y}_i)^2}$ for the $m$ test set observations.

a. Use an *elastic net* model to predict the `price` of the test data. Submit a .csv file (ensure comma separated format) named `lastname_firstname.csv` that includes the column named *yhat* that is your estimates. We will use automated evaluation, so the format must be exact.

    - You are free to use any tuning parameters

    - You are free to use any data transformation or feature engineering
    
    - You will receive credit for a proper submission; the top five scores will receive 2 bonus points.

b. Show the code you used to transform the data. Note: there are some categorical predictors so at the least you will have to convert those to something numeric (e.g., one-hot or dummy coding).

In [None]:
#your answer here.

c. Report the $\alpha$ and $\lambda$ parameters you used to make your final predictions. Describe how you choose those tuning parameters and show supporting code.

your answer here.

d. Report the anticipated performance of your method in terms of RMSE. We will see how close your performance assessment matches the actual value.

In [None]:
#your answer here.