
# Simple Linear Regression Example 

This simple linear regression example uses the `diabetes` dataset from sklearn. Only one feature is used in order to illustrate a two-dimensional plot of this regression technique. The straight line can be seen in the plot, showing how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# Load the diabetes dataset
diabetes = datasets.load_diabetes()

In [3]:
type(diabetes) # not a pandas dataframe

sklearn.utils.Bunch

In [4]:
diabetes.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names'])

In [5]:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [None]:
print(diabetes)

In [7]:
print(diabetes.DESCR)

Diabetes dataset

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani

Create a Pandas dataframe with feature names as the column names. 

In [6]:
df = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


Need to add the column for target output

In [None]:
diabetes.target

Add a new column for output in the dataframe

In [None]:
df['output'] = diabetes.target
df.head()

Check if there is any undefined NaN values 

In [None]:
df.isnull().sum()

In [None]:
corr_matrix = df.corr()
corr_matrix["output"].sort_values(ascending=False)

Now select only one feature that is most correlated to the output

In [None]:
df1 = df[['bmi','output']]  # or df1 = df.loc[:,['bmi','output']]
df1.head()

Split the data into trainning set and tet set

In [None]:
# drop the output column
X = df1.drop('output', axis = 1)
# select only the output column and make a copy 
y = df1['output'].copy()

In [None]:
type(X)

In [None]:
type(y)

In [None]:
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print(diabetes_X_train.shape)
print(diabetes_X_test.shape)
print(diabetes_y_train.shape)
print(diabetes_y_test.shape)

Create linear regression model object

In [None]:
regr = linear_model.LinearRegression()

Train the model using the training set

In [None]:
regr.fit(diabetes_X_train, diabetes_y_train)

In [None]:
# The coefficients
print('Coefficients: \n', regr.coef_)

Make predictions using the testing set

In [None]:
diabetes_y_pred = regr.predict(diabetes_X_test)

Plot the straight line produced by the model with the test data 

In [None]:
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.show()

Check the mean squared error (MSE) of this model on the test set 

In [None]:
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

R^2 (coefficient of determination) regression score function.
1 is perfect prediction where all data fall on the regression line with an MSE = 0 

In [None]:
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

### Exercise: Try to apply multiple linear regression with 2 features