# Part 1: Diabetes

In this part of the assignment, you will build a predictive model for diabetes disease progression in the next year based on current observed features of disease symptoms. 

**Learning objectives.** You will:
1. Train and test a linear model using ordinary least squares regression. 
2. Apply regularization, specifically LASSO, to build a sparse model.

The following code will download and preview three examples of the data. The ten features are as follows (in order):

- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, log of serum triglycerides level
- s6 glu, blood sugar level

The target value is a quantiative measure of disease progression after 1 year, where larger numbers are worse.

The code stores the feature matrix `X` as a two-dimensional NumPy array where each row corresponds to a data point and each column is a feature. The target value is stored as a one-dimensional NumPy array `y` where the index `i` element of `y` correpsonds to the row `i` data point of `X`.

Your overall goal in this part is to build and evaluate a linear model to predict the target variable `y` as a function of the ten features in `X`, and to identify which features are more significant for predicting `y`.

In [4]:
# Run but DO NOT MODIFY this code

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes(scaled = False)
print(diabetes.feature_names)

# Get the feature data and target variable
X = diabetes.data
y = diabetes.target

# Preview the first 3 data points
print(X[:3])
print(y[:3])

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[[ 59.       2.      32.1    101.     157.      93.2     38.       4.
    4.8598  87.    ]
 [ 48.       1.      21.6     87.     183.     103.2     70.       3.
    3.8918  69.    ]
 [ 72.       2.      30.5     93.     156.      93.6     41.       4.
    4.6728  85.    ]]
[151.  75. 141.]


## Task 1

Randomly split the input data into a [train and test partition](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), with 30% of the data reserved for testing. Use a random seed of `2024` for reproducibility of the results.

In [5]:
# Write task 1 code here
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 2024)


## Task 2

Build a baseline prediction by computing the [average](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) target value of the training data. Evaluate the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) between the baseline and the test data.

In [6]:
from sklearn.metrics import mean_squared_error
import numpy as np

baseline_prediction = np.mean(y_train)
y_baseline = np.full(y_test.shape, baseline_prediction) #need to convert

rmse_baseline = np.sqrt(mean_squared_error(y_test, y_baseline))

print(f"Baseline RMSE: {rmse_baseline}")

Baseline RMSE: 78.17581726028506


## Task 3

Build a linear predictive model using [ordinary least squares regression](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) fit on the training data. 

Evaluate the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) of the model on **both** the training data **and** the test data (that is, the training error and the generalization error). Report both and briefly discuss the results: Do you observe underfitting or overfitting?

Note that the model predictions on the test data may not be perfect, but they should improve meaningfully over the simple baseline from Task 2 or something is wrong.

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse

reg = LinearRegression()
reg.fit(X_train, y_train)

y_train_pred = reg.predict(X_train)  
y_test_pred = reg.predict(X_test)    


train_rmse = np.sqrt(mse(y_train, y_train_pred))
test_rmse = np.sqrt(mse(y_test, y_test_pred))

print(f"Training RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")


Training RMSE: 52.85235480118789
Test RMSE: 55.61674711723449


I observe overfitting, since the Training RMSE (52.85) is lower than the Test RMSE (55.61). This is because the model  performs better on the training set but slightly worse on unseen data like the test set.



## Task 4

If your goal is to understand which of the input features in `X` are most important for predicting the target `y`, the linear model you built in task 3 may not be very helpful. Build a new linear model using [Lasso regression](https://scikit-learn.org/stable/modules/linear_model.html#lasso) that achieves comparable generalization error as the task 3 model using ordinary least squares regression (within 10% of the root mean squared error on the test set), but with **0 for at least three of the model coefficients** (that is, the model does not use these features to make predictions). 

You may need to try multiple vaues of the `alpha` *hyperparameter* to find a model that satisfies both the error and *sparsity* constraints (that at least three of the coefficients are 0). Nevertheless, you should only evaluate error on the test dataset **once**. Show your work for how you find a good `alpha` in code and explain your work in English below. Standard approaches would be to split the training data into a train and validation set, or to use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html) on the training data.

For your final fit Lasso model with the chosen `alpha`, report the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) on the test data. Also report the model coefficients and use this to explain which features (see their names/interpretations above) seem less important for predicting the target.

In [13]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

alphas = np.logspace(-4, 1, 100)
lasso = Lasso(max_iter=10000) 
param_grid = {'alpha': alphas}

grid_search = GridSearchCV(lasso, param_grid, scoring='neg_root_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)

print(f"Best alpha without worrying about sparsity: {grid_search.best_params_['alpha']}")

final_alpha = None
final_coef = None

for alpha in grid_search.cv_results_['params']:
    lasso.set_params(alpha=alpha['alpha'])
    lasso.fit(X_train, y_train)
    
    if np.sum(lasso.coef_ == 0) >= 3: 
        final_alpha = alpha['alpha']
        final_coef = lasso.coef_
        break  # Stop as soon as we find a valid model

if final_alpha:
    lasso.set_params(alpha=final_alpha)
    lasso.fit(X_train, y_train)
    
    y_pred = lasso.predict(X_test)  # Predict using the test set
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    print(f"Best alpha with error and sparsity: {final_alpha}")
    print(f"Test RMSE for Lasso regression with best alpha: {test_rmse}")
    print("Model coefficients:")
    print(final_coef)



Best alpha without worrying about sparsity: 0.1519911082952933
Best alpha with error and sparsity: 3.944206059437656
Test RMSE for Lasso regression with best alpha: 57.5360272634609
Model coefficients:
[ 0.         -3.4484256   5.97353445  1.27211544  1.11532614 -1.31490185
 -1.93769332  0.          0.          0.1748216 ]


Our code uses Lasso regression to find the best value of the regularization parameter, alpha, by balancing prediction error and sparsity. A GridSearchCV process tests multiple alpha values and derives an alpha value of 0.001 without worrying about error or sparsity. However, once I loop through each alpha, I want to select the first model with fewer than three zero coefficients. After identifying the best alpha, the model is fit to the training data, and its performance is evaluated on the test set only once, calculating the final RMSE while also accounting for the three zero coefficients.  In this case, since the coefficients for age, HDL levels, and Total Cholesterol/HDL ratio are zero, it means these features are not contributing to the prediction, highlighting their lack of importance in the model.