# **Excercise Sheet 3:** Feature Selection and Regularization

# Part A: Foundations & Concepts

Before diving into coding and implementing feature selection and regularization techniques, it's important to understand the fundamental concepts and motivations behind these methods.

Take a moment to reflect on these concepts yourself before seeking additional help from ChatGPT 😉 You're also encouraged to discuss these ideas with your classmates.

## 1. General Concepts

### a) Why do we need to do feature seelction or regularization when wanting to do a linear regression model with a high number of features?

#### Your Answer:

#### Solution:
- Overfitting: the model may fit the training data too closely, capturing noise rather than the underlying pattern. Bad generalization to new data.
- Computational Cost: More features lead to increased computational complexity, making the model slower to train and evaluate.
- Interpretability: A model with too many features can be difficult to interpret.
- Numerical issues with model fitting, not able to estimate the coefficients properly.

### b) What does feature selection mean and how does it differ from regularization?

#### Your Answer:

#### Solution:
- Feature selection: Drop some features from the model, based on some criteria (e.g. correlation with target variable, statistical tests, etc.). The model is trained only on the selected features.
- Regularization: All features are still included in the model, but their impact is reduced.

### c) Can we be confident that we select the correct variables in feature selection?

#### Your Answer:

#### Solution:
- It depends, in general it's difficult.
- One issue could be that with a lot of predictors, some of them might just relate to the target by chance, and we might select them wrongly.
- Another issue is that with highly correlated predictors, we might select one of them, but not the other, even if both are important for the outcome.

# Part B: Coding & Visualization

Now let's apply our knowledge of feature selection and regularization! We start with some imports needed.

*Hint:* The functions imported from the Helper file, may help you in certain tasks, but you are not required to use them. You can also write your own code to achieve the same results. 

## Notes:
- We can maybe also ennumerate the subpoints of a task, that would make it easier to read instead of bullet points

In [None]:
# imports
import numpy as np
import pandas as pd


## 2. Linear Regression with high-dimensional data

- Load and inspec the dataset `data/todo.csv`
- What types of predictors and outcomes are present in the dataset?
- Plot pairplots of the first 5 variables

- Split the dataset into training and test set (70%/30% split)

- Perform a linear regression using first using all features.

- Now perform a linear regression using only `insert` features. Compare the two models using the linear regression table (Todo figure out what that is)

- Calculate MSE and R2 for both models on the training data. What can you conclude?

- Now predict and calculate MSE and R2 for both models on the test data. Is there a difference? What can you conclude?

- Show a plot of the fitted regression lines for the test data. How is the overfitting of model 2 visible? What features? Using PCA? 

## 3. PCA Regression

- Load the dataset `data/todo.csv`
- What types of predictors and outcomes are present in the dataset?
- Plot pairplots of the first 5 variables

- Note:
Use a dataset with numerical features that results in singularity error when linear regression with all predictors is fitted, ie. when p>>n. 

- Perform a linear regression using all the features and print the regression tabble. What do you observe?

Solution: All the coefficients and p-values should be NANs. See here for example: https://bookdown.org/staedler_n/highdimstats/multiple-linear-regression.html#overfitting  

Issue is p>>n; not able to compute coefficients. 

- Perform PCA on the dataset. Is it important to scale the data before PCA? 

*Hint:* Check if the units of the data differs

- Plot the cumulative explained variance (or individual explained variance). How many are needed to explain 80% of the variance?

- Now let's compute the linear regression using enough principal components to explain 80% of the variance.

*Hint:* if you did not solve the previous exercise, continue with 20 principal components.

- Print the regression table. Are the coefficients interpretable

## 4. Ridge and Lasso Regression

- Load the dataset `data/todo.csv`
- What types of predictors and outcomes are present in the dataset?
- Plot pairplots of the first 5 variables

- Note maybe reuse dataset

- Split dataset into training and test set (70%/30% split)

- Fit Ridge regression with different alphas (use the function) using training data

- Show coefficient shrinkage for different alphas. What do you observe? Do you expect coefficients to be zero for some alphas?

- Fit Lasso regression with different alphas (use the function). Using training data

- Show the coefficients for each alpha. How do they change with increasing alpha? Are features dropped?

- Compare all Models (linear regresion on features, linear regression on PCA, Ridge and Lasso) using MSE and R2 on the test data. Which model performs best?

## 5. Cross-Validation

Does it make sense to put this here? maybe to reinforce the concepts?

- What is the difference between train-test split and cross-validation?

#### Your Answer:


#### Solution:
In cross validation all data is used once as train or test. Usually used when not having a lot of data. Train-test split is used when you have a lot of data and you can afford to keep some data only for test set. 

- perform a cross-validation(5-folds) with the lasso model from last exercise. Report the MSE for all folds. Are they different to the MSE of the test set form the previous exercise?

- Finally we introduce a new concept which is common in clincal data science. "Leave-One-Out Cross-Validation" (LOOCV). In this method, we leave one sample out for testing and use the rest for training. This is repeated for each sample in the dataset. Implement this and compare the results to the 5-fold cross-validation. What do you observe?


*Hint:* Give them a hint on how to implement this