# Diabetes dataset: linear regression. 

Let's explore the datasets that are included in this Python library. These datasets have been cleaned and formatted for use in ML algorithms.

In [1]:
import numpy as np
import pandas as pd  
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
diabetes = datasets.load_diabetes()
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True) #X == Matrix, y == vector

In [3]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

#### Based on the data description, answer the following questions:

1. How many attributes are there in the data? What do they mean?

1. What is the relation between `diabetes['data']` and `diabetes['target']`?

1. How many records are there in the data?

1. There are 10 diferent attributes within the data. They are the diferent variables taking into account to make a regesion model that would allow a prediction of new values.
2. The values are the data taking into considetation for trying to predict if the target is going to be reach.
3. There are 442 records.

#### Now explore what are contained in the *data* portion as well as the *target* portion of `diabetes`. 

Scikit-learn typically takes in 2D numpy arrays as input (though pandas dataframes are also accepted). Inspect the shape of `data` and `target`. Confirm they are consistent with the data description.

In [7]:
print("diabetes_X.shape:", diabetes_X.shape)
print("diabetes_y.shape:", diabetes_y.shape)

diabetes_X.shape: (442, 10)
diabetes_y.shape: (442,)


In [9]:
#Generation of train & test subsets with a proportion of 80:20
diabetes_X_train = diabetes_X[:-20] 
diabetes_y_train = diabetes_y[:-20]

diabetes_X_test = diabetes_X[-20:]
diabetes_y_test = diabetes_y[-20:]

## Buliding the regression model 

In [10]:
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [11]:
lg_model = LinearRegression(n_jobs = -1)

In [14]:
lg_model.fit(diabetes_X_train, diabetes_y_train) #train of the model with the corresponding data 

print("Linear regression intercep:", lg_model.intercept_)
print("Linear regression coefficients:", lg_model.coef_)

Linear regression intercep: 152.76430691633442
Linear regression coefficients: [ 3.03499549e-01 -2.37639315e+02  5.10530605e+02  3.27736980e+02
 -8.14131709e+02  4.92814588e+02  1.02848452e+02  1.84606489e+02
  7.43519617e+02  7.60951722e+01]


#### Analysis of the results

From the outputs you should have seen:
- The intercept is a float number.
- The coefficients are an array containing 10 float numbers.


In [21]:
diabetes_y_prediction = lg_model.predict(diabetes_X_test)

In [19]:
#Comparatin between the predicted targets and the real data
print(diabetes_y_test)
print(diabetes_y_prediction)

[233.  91. 111. 152. 120.  67. 310.  94. 183.  66. 173.  72.  49.  64.
  48. 178. 104. 132. 220.  57.]
[197.61846908 155.43979328 172.88665147 111.53537279 164.80054784
 131.06954875 259.12237761 100.47935157 117.0601052  124.30503555
 218.36632793  61.19831284 132.25046751 120.3332925   52.54458691
 194.03798088 102.57139702 123.56604987 211.0346317   52.60335674]


In [24]:
mse = mean_squared_error(diabetes_y_test, diabetes_y_prediction)
r2 = r2_score(diabetes_y_test, diabetes_y_prediction)
print("The predictions made by this Linear Regression have this accuracy coefficients: msn = ", round(mse, 2), "& r2 =", round(r2, 2))

The predictions made by this Linear Regression have this accuracy coefficients: msn =  2004.57 & r2 = 0.59


This means that the prediction of this dataset by *Linear Regression* is NOT very accurate. 