# Linear Regression Example - Diabetes dataset

This lesson uses the [diabetes dataset](course_datasets.md#diabetes).  It is based on the linear regression diabetes tutorial in Section 2.1  of Microsoft's ML for beginners course [here](https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression/1-Tools).


In [72]:
import matplotlib.pyplot as plt # for charts
import numpy as np
# sklearn is the scikit-learn package
from sklearn import datasets, linear_model, model_selection

The diabetes dataset is an internal sample dataset in the scikit-learn package.

The X data is the array of independent variables.  The y data is the vector of labels

In [None]:
X, y = datasets.load_diabetes(return_X_y=True)
print('shape of X:', X.shape)
print('first row of X:', X[0]) #


Get all rows, but only 3rd column (which is the BMI value) from the independent variables

In [None]:
X_BMI = X[:, 2]
print('shape of X_BMI:', X_BMI.shape)
print('first row of X_BMI:', X_BMI[0]) #
print('number of dimensions of X_BMI:', X_BMI.ndim) #



In [None]:
#  Create a 2D array from the 1D array by reshaping it to (-1,1). The matplotlib library expects the data to be in this format.
X_BMI_reshaped = X_BMI.reshape((-1,1))
print('shape of X_BMI_reshaped:', X_BMI_reshaped.shape)
print('first row of X_BMI_reshaped:', X_BMI_reshaped[0]) #
print('number of dimensions of X_BMI_reshaped:', X_BMI_reshaped.ndim) #


In [None]:
# spilt into train and test datasets
#X_train, X_test, y_train, y_test = model_selection.train_test_split(X_BMI_reshaped, y, test_size=0.33)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
print('shape of X_train:', X_train.shape, ' y_train:', y_train.shape)
print('first row of X_train:', X_train[0], ' y_train:', y_train[0]) #
print('number of dimensions of X_train:', X_train.ndim, ' y_train', y_train.ndim )


Create a linear regression model and train it with the training data set.

In [None]:
#model = linear_model.LinearRegression()
model = linear_model.Ridge()
model.fit(X_train, y_train)
model.coef_, model.intercept_, model

Create a prediction using the predict() function on the test data set.

In [None]:
y_pred = model.predict(X_test)
print('shape of y_pred:', y_pred.shape)

Create a scatterplot of all the X and y test data, 
Use the prediction to draw a line

In [None]:
plt.scatter(X_test[:,2], y_test,  color='black')
plt.scatter(X_test[:, 2], y_pred,  color='green')
#plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xlabel('Scaled BMIs')
plt.ylabel('Disease Progression')
plt.title('A Graph Plot Showing Diabetes Progression Against BP')
plt.show()

Possible Exercise(s): 
1. Use another column of the X data and do a linear regression
1. Score the model i.e. measure the accuracy.  (Code examples are in other lessons)
1. Do a multivariate analysis rather than a univariate analysis on this dataset
1. Use the [linnerud dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud)



In [None]:
accuracy = model.score(X_test, y_test)
accuracy