# Assignment 1: Regression 

For this assignment, you have four tasks on the dataset diabetes. The dataset is splitted into training and test data with fix seed (random_state fixed) to ensure the reproducibility.
- Task 1: Fit a linear regression model on training data and evaluate mean_squared_error on testing data
- Task 2: Feature selection using Lasso regression on training data and report the top 4 most informative features
- Task 3: Fit a kernel regression model with the top 4 features and report the mean_squared_error on testing data
- Task 4: Get better performance (lower mean_squared_error on testing data using a model trained on the training data) with any models/hyperparamters/options (mentioned in the lecture or beyond the lecture. The smaller test error you have, the more points you will receive (like a data science competition). Enjoy!


In [1]:
# Load necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.datasets import load_diabetes

# Load Dataset: 
diabetes = load_diabetes()

print(diabetes.feature_names)
print(diabetes.DESCR)
X, y = load_diabetes(return_X_y=True)

# slipt Dataset:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s

## Task 1: Linear Regression 
Fit a linear regression model on training data and evaluate mean_squared_error on testing data

In [2]:
# define and fit your regression model
model = LinearRegression(fit_intercept=True)
model.fit(X_train, y_train)

# prediction on testing data
y_pred = model.predict(X_test)

# evaluate the prediction on testing data
print("mean_squared_error is :")
print(mean_squared_error(y_test, y_pred))

mean_squared_error is :
3424.316688213733


## Task 2: Feature Selection 
#### Part 1: Find optimal alpha for Lasso using cross validation (GridSearchCV)

In [3]:
# alpha range
tuned_parameters = [{'alpha': [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2]}]

# grid search cross validation on Lasso w.r.t. alpha
gs = GridSearchCV(Lasso(fit_intercept=True), tuned_parameters, scoring='neg_mean_squared_error', cv=5)
gs.fit(X_train, y_train)
print("Best parameters set found on training set:")
print(gs.best_params_)

Best parameters set found on training set:
{'alpha': 0.02}


#### Part 2: Using the optimal alpha, find the top 4 most informative/useful features

In [4]:
# use the optimal alpha, fit Lasso model
model = Lasso(alpha = 0.02, fit_intercept=True)
model.fit(X_train, y_train)

# after fitting the model, print the coefficient and find the top 4 features
print("Lasso coef:", model.coef_)

Lasso coef: [ -19.71806138 -225.30786454  567.7680786   287.2091127  -223.21051814
   -0.         -179.89846191   76.44365845  573.53367285   37.61455519]


## Task 3: Kernel Regression with Top 4 Features
Fit a kernel regression model with the top 4 features and report the mean_squared_error on testing data

In [5]:
# select the top 4 features
X_train_top4 = X_train[:,[8, 2, 3, 1]]
X_test_top4 = X_test[:,[8, 2, 3, 1]]

# fit kernel regression model
KernelReg = sm.nonparametric.KernelReg
model = KernelReg(y_train, X_train_top4, reg_type='ll',var_type='cccc', ckertype = 'gaussian')
y_pred , mfx = model.fit(X_test_top4)
print("mean_squared_error is :")
print(mean_squared_error(y_test, y_pred))

mean_squared_error is :
3353.2438025814304


## Task 4: Even higher performance


In [6]:
# This is your playground, try to improve the performance! 
from sklearn.neighbors import KNeighborsRegressor

def kernel(distance):
    kernel_width = 5
    weights = np.exp(-(distance**2)/kernel_width)
    return weights

knn = KNeighborsRegressor(n_neighbors = 15, weights = kernel, metric = 'chebyshev', n_jobs = -1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
cost = mean_squared_error(y_test, y_pred)
print("mean_squared_error is :")
print(cost)

mean_squared_error is :
3201.4548976261594
