# Exploring Lasso and Ridge Regression, kNN, and Cross-validation

## 1. Introduction

Welcome to this hands-on lab where we will dive deep into some essential machine learning techniques. By the end of this lab, executed within this Jupyter Notebook, you should have a more tangible grasp of Lasso and Ridge Regression, k-Nearest Neighbors (kNN), and Cross-validation.

## Learning Objectives:


- Understand the importance of regularization in preventing overfitting.
- Differentiate between L1 (Lasso) and L2 (Ridge) regularization.
- Apply Lasso and Ridge regression to a dataset and observe the effect on model coefficients.
- Implement k-fold corss validation to fine tune the impact of regularisation (value of alpha)
- Grasp the underlying concept behind instance-based learning and how kNN makes predictions based on data proximity.
- Explore the effect of the hyperparameter 'k' on model performance.
- Delve into the significance of distance metrics in kNN.


As we navigate through the lab, there will be hands-on exercises, reflection points, and visualization segments to reinforce the concepts and allow you to observe the real-world implications of these techniques.

Let's embark on this exciting journey!


# # 2. Diabetes Dataset from Scikit-learn

The Diabetes dataset from Scikit-learn is a well-known dataset used in regression problems. It contains medical data from diabetic patients and is often used to predict disease progression based on various health indicators.

- **Number of Instances (samples):** 442
- **Number of Attributes (features):** 10
- **Type of Problem:** Regression (predicts a continuous value, which is a quantitative measure of disease progression one year after baseline)

### Features
The dataset includes the following 10 baseline variables (features), all of which are numerical and standardized:

1. **age**: Age of the patient.
2. **sex**: Gender of the patient.
3. **bmi**: Body Mass Index.
4. **bp**: Average blood pressure.
5. **s1**: T-Cells (a type of white blood cells).
6. **s2**: Low-Density Lipoproteins (LDL cholesterol).
7. **s3**: High-Density Lipoproteins (HDL cholesterol).
8. **s4**: Thyroid Stimulating Hormone (TSH).
9. **s5**: Lamotrigine concentration.
10. **s6**: Blood sugar level.

### Target Variable
- A quantitative measure of disease progression one year after the baseline.

The data is already pre-processed and normalized, which means each feature's mean is 0 and its standard deviation is 1. This normalization makes the features comparable and helps with the training of regression models.

### Loading the Dataset

In [3]:
from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
diabetes_data = load_diabetes()

# Create a DataFrame from the dataset
diabetes_df = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)

# Add the target variable (disease progression) to the DataFrame
diabetes_df['Disease_Progression'] = diabetes_data.target

diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,Disease_Progression
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


## 3. Lasso and Ridge Regression

Through this section, we aim to:

- Fit a basic linear regression model and observe its potential for overfitting with many features.
- Apply Lasso (L1) and Ridge (L2) regularization and observe their effects on the model's coefficients and performance.

### Task: Simple Linear Regression without Regularization:

1. **Split the Data**:
   - Use `train_test_split` from `sklearn.model_selection` to divide your data into training and testing sets (70% training, 30% testing).

2. **Train the Model**:
   - Train a **Linear Regression** model using the training data.
   - Display the coefficients of the trained model to understand the contribution of each feature.

3. **Evaluate the Model**:
   - Compute and print the model's score (R²) on both the training and testing sets.

In [4]:
# Write your code below

  feature  coefficient
0     age    29.250346
1     sex  -261.707681
2     bmi   546.297373
3      bp   388.400773
4      s1  -901.953387
5      s2   506.761149
6      s3   121.148459
7      s4   288.029325
8      s5   659.271338
9      s6    41.375369
Train score: 0.5244
Test score: 0.4773


 Some coefficients have become particularly high, which can be a sign of overfitting. 

### Lasso (L1) and Ridge (L2) Regularization:

Below, I have implemented Lasso Regression for you. We'll train the model, print the feature coefficients, and evaluate its performance on both training and testing data.

In [5]:
from sklearn.linear_model import Lasso, Ridge

# Lasso Regression
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
lasso_coeffs = pd.DataFrame({'feature': X.columns, 'coefficient': lasso.coef_})
print("Lasso Coefficients:")
print(lasso_coeffs)

# Scoring
lasso_train_score = lasso.score(X_train, y_train)
lasso_test_score = lasso.score(X_test, y_test)

print(f"\nLasso Train score: {lasso_train_score:.4f}")
print(f"Lasso Test score: {lasso_test_score:.4f}")

Lasso Coefficients:
  feature  coefficient
0     age     0.000000
1     sex    -0.000000
2     bmi   443.702165
3      bp    51.603401
4      s1     0.000000
5      s2     0.000000
6      s3    -0.000000
7      s4     0.000000
8      s5   201.967127
9      s6     0.000000

Lasso Train score: 0.3562
Lasso Test score: 0.3619


Now it's your turn! Follow the same steps above, but implement **Ridge Regression**. Here's what you need to do:

1. Import the `Ridge` model from `sklearn.linear_model`.
2. Initialize the `Ridge` model with `alpha=1.0`.
3. Fit the Ridge model using the training data (`X_train`, `y_train`).
4. Display the coefficients of the Ridge model.
5. Compute and print the Ridge model's scores for both training and testing sets.


In [6]:
# Write your code here



Ridge Coefficients:
  feature  coefficient
0     age    45.053767
1     sex   -71.947551
2     bmi   280.715875
3      bp   195.213699
4      s1    -2.229433
5      s2   -17.541159
6      s3  -148.688994
7      s4   120.467093
8      s5   198.614859
9      s6   106.934534
Ridge Train score: 0.4283
Ridge Test score: 0.4233


By comparing the coefficients and scores of the basic linear regression model with those of the Lasso and Ridge models, we can observe:

- Lasso regression might set some coefficients to zero, effectively selecting a subset of the features.
- Ridge regression tends to shrink the coefficients but generally doesn't set them to zero.
- Regularization may lead to a model that generalizes better.

## 4. Fine-tuning alpha with k-Fold Cross-validation for L1 and L2 Regularization

Cross-validation (CV) is a robust method for assessing the performance of machine learning models and for hyperparameter tuning. By dividing our data into 'k' folds and training and testing our model k times, we can get a more accurate measure of its performance.

In the context of Lasso (L1) and Ridge (L2) regression, the hyperparameter we're most interested in tuning is 
α (alpha), which determines the strength of the regularization. A higher value of α increases the regularization strength, penalizing high coefficients more severely.

Let's employ GridSearchCV from sklearn, which performs cross-validation and grid search for hyperparameter tuning simultaneously.

**Below**, I have implemented GridSearchCV to find the best alpha for Lasso Regression. The model is trained using various values of alpha and evaluated using a 5-fold cross-validation strategy.

In [8]:
from sklearn.model_selection import GridSearchCV
import numpy as np

# Define a list of alphas to check
alphas = np.logspace(-6, 6, 13)

# Lasso Regression with CV
lasso = Lasso()
parameters = {'alpha': alphas}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='r2', cv=5)
lasso_regressor.fit(X_train, y_train)

print("Lasso Regression: Best α =", lasso_regressor.best_params_['alpha'], "with R^2 score =", lasso_regressor.best_score_)

Lasso Regression: Best α = 0.01 with R^2 score = 0.45258380038989293


### Your Task: Ridge Regression with GridSearchCV

Now, it's your turn to apply the same procedure for **Ridge Regression**. Here's what you need to do:

1. Import the `Ridge` model from `sklearn.linear_model`.
2. Initialize the `Ridge` model.
3. Define a grid of `alpha` values using `np.logspace(-6, 6, 13)`.
4. Use `GridSearchCV` to train and evaluate the Ridge model using 5-fold cross-validation.
5. Print the best `alpha` value and the corresponding R² score.


In [9]:
# Write your code here

Ridge Regression: Best α = 0.001 with R^2 score = 0.45289008081681914


With the best α values obtained from the k-fold cross-validation, we can retrain our Lasso and Ridge regression models.

In [10]:
# Retrain using best alphas
optimal_alpha_lasso = lasso_regressor.best_params_['alpha']
optimal_alpha_ridge = ridge_regressor.best_params_['alpha']

lasso_optimal = Lasso(alpha=optimal_alpha_lasso)
ridge_optimal = Ridge(alpha=optimal_alpha_ridge)

lasso_optimal.fit(X_train, y_train)
ridge_optimal.fit(X_train, y_train)

Ridge(alpha=0.001)

In [11]:
lasso_optimal_train_score = lasso_optimal.score(X_train, y_train)
lasso_optimal_test_score = lasso_optimal.score(X_test, y_test)
ridge_optimal_train_score = ridge_optimal.score(X_train, y_train)
ridge_optimal_test_score = ridge_optimal.score(X_test, y_test)

# Print out scores for comparison
print("Optimal Lasso: Train score = {:.4f}, Test score = {:.4f}".format(lasso_optimal_train_score, lasso_optimal_test_score))
print("Optimal Ridge: Train score = {:.4f}, Test score = {:.4f}".format(ridge_optimal_train_score, ridge_optimal_test_score))

Optimal Lasso: Train score = 0.5231, Test score = 0.4787
Optimal Ridge: Train score = 0.5243, Test score = 0.4775


From the results, we can observe whether the retrained models (with optimal 
α) show better or similar performance to the original ones. Usually, the optimal models should generalize better, which would be evident from a better or comparable test score.

## 5. k-Nearest Neighbors (kNN)

The k-Nearest Neighbors (kNN) algorithm is a simple, intuitive, non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. For classification, the output is a class membership: an object is classified by a majority vote of its neighbors.

### Implementing kNN Regression

First, let's use kNN as a regression tool on the Boston Housing dataset.

In [12]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize kNN regressor with k=5
knn_regressor = KNeighborsRegressor(n_neighbors=5)

# Train the model
knn_regressor.fit(X_train, y_train)

# Predict and score
knn_train_score = knn_regressor.score(X_train, y_train)
knn_test_score = knn_regressor.score(X_test, y_test)

print(f"kNN Regressor (k=5): Train score = {knn_train_score:.4f}, Test score = {knn_test_score:.4f}")


kNN Regressor (k=5): Train score = 0.5727, Test score = 0.4031


## Student Task: Fine-tuning k

Choosing the right k value is crucial. A small k can be noisy and subject to outliers, while a large k can be computationally intensive and might smoothen the output excessively. Accordingly,

- Fine tune k using Cross-validation
- Retrain kNN with optimal k