# Part 2: Diabetes

In this part of the assignment, you will build a predictive model for diabetes disease progression in the next year based on current observed features of disease symptoms. 

**Learning objectives.** You will:
1. Train and test a linear model using ordinary least squares regression.
2. Use numerical Python (NumPy) and the standard `sklearn` API in Python  
3. Train and test a polynomial model, comparing to the linear model and demonstrating overfitting
4. Apply regularization, specifically LASSO, to build a sparse linear model

The following code will download and preview three examples of the data. The ten features are as follows (in order):

- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, log of serum triglycerides level
- s6 glu, blood sugar level

The target value is a quantiative measure of disease progression after 1 year, where larger numbers are worse.

The code stores the feature matrix `X` as a two-dimensional NumPy array where each row corresponds to a data point and each column is a feature. The target value is stored as a one-dimensional NumPy array `y` where the index `i` element of `y` correpsonds to the row `i` data point of `X`.

Your overall goal in this part is to build and evaluate a linear model to predict the target variable `y` as a function of the ten features in `X`, and to identify which features are more significant for predicting `y`.

In [3]:
# Run but DO NOT MODIFY this code

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes(scaled = False)
print(diabetes.feature_names)

# Get the feature data and target variable
X = diabetes.data
y = diabetes.target

# Preview the first 3 data points
print(X[:3])
print(y[:3])

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[[ 59.       2.      32.1    101.     157.      93.2     38.       4.
    4.8598  87.    ]
 [ 48.       1.      21.6     87.     183.     103.2     70.       3.
    3.8918  69.    ]
 [ 72.       2.      30.5     93.     156.      93.6     41.       4.
    4.6728  85.    ]]
[151.  75. 141.]


## Task 1

Use `sklearn` to randomly split the input data into a [train and test partition](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), with 30% of the data reserved for testing. Use a random seed of `2025` for reproducibility of the results.

Print the number of data points in the resulting train and test partitions.

In [5]:
from sklearn.model_selection import train_test_split

# Write task 1 code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 2025)
print("Number of data points in the resulting train partition is", X_train.shape[0])
print("Number of data points in the resulting test partition is", X_test.shape[0])

Number of data points in the resulting train partition is 309
Number of data points in the resulting test partition is 133


## Task 2

Build a baseline prediction by computing the [average](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) target value of the training data and predicting this for average for every test data point.

For example, if the training data target values were `[2, 2, 5]` then you would compute the average as `3`. If there there were only two test data points, then your predictions would be simply `[3, 3]`.

Evaluate the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) between the baseline and the test data.

This RMSE will serve as a benchmark - any useful model should achieve a lower error than this simple baseline.

In [7]:
import numpy as np
from sklearn.metrics import root_mean_squared_error

# Write task 2 code here
y_train_mean = np.mean(y_train)
y_baseline_prediction = np.full_like(y_test, y_train_mean)
RMSE_baseline = root_mean_squared_error(y_test, y_baseline_prediction)
print("The RMSE between baseline predictions and y_test is", RMSE_baseline)

The RMSE between baseline predictions and y_test is 75.8165287907097


## Task 3

Use [`sklearn`](https://scikit-learn.org/stable/) to fit a linear predictive model on the training data using [ordinary least squares regression](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares). 

Evaluate the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) of the model on **both** the training data **and** the test data (that is, the training error and the generalization error). Report both results.

Note that the model predictions on the test data may not be perfect, but they should improve meaningfully over the simple baseline from Task 2 or something is wrong.

In [9]:
from sklearn.linear_model import LinearRegression

# Write task 3 code here
model = LinearRegression()
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
RMSE_train = root_mean_squared_error(y_train, y_train_pred)
RMSE_test = root_mean_squared_error(y_test, y_test_pred)
print("The RMSE of the model on the training data is", RMSE_train)
print("The RMSE of the model on the test data is", RMSE_test)

The RMSE of the model on the training data is 52.97771955290354
The RMSE of the model on the test data is 55.14187488833094


## Task 4

To understand which input features are most important for predicting diabetes progression, we need a model that can automatically select relevant features. The ordinary least squares model from Task 3 uses all features, making it harder to identify which ones truly matter.

Build a new linear model using [Lasso regression](https://scikit-learn.org/stable/modules/linear_model.html#lasso) that meets two criteria:
  - Performance: At most 10% greater error than the linear model with all the features in task 3. 
  - Sparsity: At least three model coefficients set to 0 (meaning the model does not use these features to make predictions). You can treat any coefficient less than 0.0001 as effectively 0 for this task.

You may need to try multiple vaues of the `alpha` *hyperparameter* to find a satisfy both constraints. For example, you can try [0.1, 1, 5, 10, ...]. The final LASSO model is only required to satisfy the above two criteria. Nevertheless, you should only evaluate error on the test dataset **once** after searching for such a value of `alpha`. Use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html) on the training data or split the training data into train and validation sets.

For your final Lasso model with the chosen `alpha` fit on all of the training data, report the [root mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) of the model predictions on the test data. Report the three or more features for which the model coefficients were set to 0 (see feature names/interpretations above). Also please explain why you selected this `alpha`.

In [11]:
from sklearn.linear_model import Lasso

# Write task 4 code here
X_train_sub, X_validation, y_train_sub, y_validation = train_test_split(X_train, y_train, test_size = 0.3, random_state = 2025)
alphas = np.r_[1e-3, 1e-2, 1e-1, np.arange(1, 101)]
results = []
for curr_alpha in alphas:
    temp_lasso_model = Lasso(alpha = curr_alpha)
    temp_lasso_model.fit(X_train_sub, y_train_sub)
    y_temp_prediction = temp_lasso_model.predict(X_validation)
    RMSE_value = root_mean_squared_error(y_validation, y_temp_prediction)
    num_coef_zeros = np.sum(np.abs(temp_lasso_model.coef_) < 1e-4)
    results.append((curr_alpha, RMSE_value, num_coef_zeros))
good_candidates = sorted([result for result in results if result[2] >= 3], key = lambda r: r[1])
final_alpha = None
final_lasso_model = None
final_zero_count = None
final_zero_features = None
for candidate_alpha, candidate_RMSE, candidate_coef_zeros in good_candidates:
    current_final_lasso_model = Lasso(alpha = candidate_alpha)
    current_final_lasso_model.fit(X_train, y_train)
    zero_count = int((np.abs(current_final_lasso_model.coef_) < 1e-4).sum())
    if zero_count >= 3:
        final_alpha = candidate_alpha
        final_lasso_model = current_final_lasso_model
        final_zero_count = zero_count
        final_zero_features = [name for name, zero in zip(diabetes.feature_names, np.abs(current_final_lasso_model.coef_) < 1e-4) if zero]
        break
y_test_lasso_pred = final_lasso_model.predict(X_test)
RMSE_test_lasso_model = root_mean_squared_error(y_test, y_test_lasso_pred)
print("The chosen alpha for the final Lasso model is", final_alpha)
print("The test RMSE for the Lasso model is", RMSE_test_lasso_model)
print("Does the Lasso model meet the performance requirement?:", RMSE_test_lasso_model <= (RMSE_test * 1.10))
print("Does the Lasso model meet the sparsity requirement?:", final_zero_count >= 3)
print("Features that have zero coefficients are", final_zero_features)

The chosen alpha for the final Lasso model is 16.0
The test RMSE for the Lasso model is 56.36195251114452
Does the Lasso model meet the performance requirement?: True
Does the Lasso model meet the sparsity requirement?: True
Features that have zero coefficients are ['age', 'sex', 's4', 's5']


My final alpha I ultimately selected was 16.0 as the final lasso model manages to pass the performance and sparsity requirements. To select this alpha value, I first split the training data into training and validation sets. For each alpha I trained on the new training data, I computed the validation RMSE, counted how many coefficients were zero, and added the tuple into the result list. I then sorted the result list based on validation RMSE and if the number of features that have a coefficent of zero is at least 3 which makes the alpha a good candidate. I then ran another for loop where I refit each alpha candidate on the full training set and break once a candidate model has at least three features with coefficients of zero. Once I break the loop, I eveluate the model on the test set. 

Initially, I only used one for loop where I just used the result of the validation split to see which alpha meets the criteria. My first alpha I selected was 3.0 since it has a good validation RMSE and has 3 features that has a coefficent of zero. But when I refitted the model with alpha being 3.0 on the entire training set and looked at the coefficents, I saw that only 2 coefficents were below the 0.0001 threshold. I find that with Lasso, small changes in the training data like going from sub training data to full training data does alter the coefficents to the point that the final model no longer satisfies the sparsity requirement. This is why I decided to create another for loop over the sorted candidates to see if the model with the full training data and the alpha value are able to pass the requirements.

At the end, the alpha I chosen is 16.0 as it was the smallest error model that can still pass the sparsity requirement with at least three features having coefficients of zero after refitting on the full training set. This model also passes the performance requirement as the test RMSE is at most 10% greater error than the linear model with all the features in task 3.