> This is a self-correcting activity generated by [nbgrader](https://nbgrader.readthedocs.io). Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Predict diabetes evolution

In this activity, you'll train several regression models to predict the disease progression one year after.

The [Diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Import ML packages
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

scikit-learn version: 0.23.2


## Step 1: Loading the data

In [3]:
dataset = load_diabetes()

# Put data in a pandas DataFrame
df_diab = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_diab["target"] = dataset.target
# Show 10 random samples
df_diab.sample(n=10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
377,0.019913,0.05068,0.009961,0.018429,0.014942,0.044719,-0.061809,0.07121,0.009436,-0.063209,235.0
419,-0.020045,-0.044642,-0.054707,-0.053871,-0.066239,-0.057367,0.011824,-0.039493,-0.074089,-0.00522,42.0
85,0.045341,-0.044642,0.071397,0.001215,-0.009825,-0.001001,0.015505,-0.039493,-0.04118,-0.071494,141.0
107,0.027178,-0.044642,0.04984,-0.055018,-0.002945,0.040648,-0.058127,0.052759,-0.052959,-0.00522,144.0
328,-0.038207,-0.044642,0.067085,-0.060757,-0.029088,-0.023234,-0.010266,-0.002592,-0.001499,0.019633,78.0
179,-0.023677,-0.044642,-0.015906,-0.012556,0.020446,0.041274,-0.043401,0.034309,0.014072,-0.009362,151.0
277,-0.034575,-0.044642,-0.059019,0.001215,-0.053855,-0.078035,0.067048,-0.076395,-0.021394,0.015491,64.0
21,-0.08543,0.05068,-0.022373,0.001215,-0.037344,-0.026366,0.015505,-0.039493,-0.072128,-0.017646,49.0
406,-0.05637,-0.044642,-0.080575,-0.084857,-0.037344,-0.037013,0.033914,-0.039493,-0.056158,-0.137767,72.0
189,-0.001882,-0.044642,-0.066563,0.001215,-0.002945,0.00307,0.011824,-0.002592,-0.020289,-0.02593,79.0


## Step 2: Preparing the data

### Question

Split the dataset into training (variables `x_train`, `y_train`) and test sets (variables `x_test`, `y_test`) with a 20% ratio.

In [4]:
# YOUR CODE HERE
x = df_diab.drop(columns="target").to_numpy()
y = df_diab["target"].to_numpy()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)

In [5]:
print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")

assert x_train.shape == (353, 10)
assert y_train.shape == (353,)
assert x_test.shape == (89, 10)
assert y_test.shape == (89,)

x_train: (353, 10). y_train: (353,)
x_test: (89, 10). y_test: (89,)


## Step 3: Training several models

In [6]:
def eval_model(model):
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Train and test MSE
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)

    print(f"Training MSE: {train_mse:.2f}. Test MSE: {test_mse:.2f}")
    
    return train_mse, test_mse

### Question

Create and train a Decision Tree, a MultiLayer Perceptron and a Random Forest on the training data.

Compute their MSE on the training and test data.

In [7]:
# Import the needed sicki-learn packages
# YOUR CODE HERE
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor

In [8]:
# Create and train a Decision Tree
# YOUR CODE HERE
dt_model = DecisionTreeRegressor()
dt_model.fit(x_train, y_train)
eval_model(dt_model)

Training MSE: 0.00. Test MSE: 4927.69


(0.0, 4927.685393258427)

In [9]:
# Create and train a MLP
# YOUR CODE HERE
mlp_model = MLPRegressor(hidden_layer_sizes=(100,100,100,))
mlp_model.fit(x_train, y_train)
eval_model(mlp_model)

Training MSE: 3064.92. Test MSE: 2943.93


(3064.91726650784, 2943.928327723847)

In [10]:
# Create and train a Random Forest
# YOUR CODE HERE
rf_model = RandomForestRegressor()
rf_model.fit(x_train, y_train)
eval_model(rf_model)

Training MSE: 486.33. Test MSE: 3161.15


(486.3278804532578, 3161.1519022471916)

## Step 4: Tuning the most promising model

### Question

Choose the most promising model and tune it, using a `GridSearchCV` instance stored in the `grid_search_cv` variable.

Your test MSE should be less than 3500.

In [14]:
# YOUR CODE HERE
param_grid = [
    {"n_estimators": [10, 50, 100, 150], "max_features": [1, 2, 4, 6, 8, 10]},
]
grid_search_cv = GridSearchCV(rf_model, param_grid=param_grid, verbose=1, cv=5, scoring="neg_mean_squared_error")

In [15]:
# Search for the best parameters with the specified classifier on training data
grid_search_cv.fit(x_train, y_train)

# Print the best combination of hyperparameters found
print(grid_search_cv.best_params_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
{&#39;max_features&#39;: 2, &#39;n_estimators&#39;: 100}
[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:   27.7s finished


In [16]:
# Evaluate best estimator
train_mse, test_mse = eval_model(grid_search_cv.best_estimator_)

assert train_mse < 1000
assert test_mse < 3500

Training MSE: 456.19. Test MSE: 3286.14
