# Linear regression in diabetes dataset

Let's explore the datasets that are included in this Python library. These datasets have been cleaned and formatted for use in ML algorithms.

Exploremos los conjuntos de datos que se incluyen en esta biblioteca de Python. Estos conjuntos de datos se han limpiado y formateado para su uso en algoritmos de ML.

First, we will load the diabetes dataset. Do this in the cell below by importing the datasets and then loading the dataset  to the `diabetes` variable using the `load_diabetes()` function ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html)).

Primero, cargaremos el conjunto de datos de diabetes. Haga esto en la celda a continuación importando los conjuntos de datos y luego cargando el conjunto de datos en la variable `diabetes` usando la función `load_diabetes()`

In [7]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import seaborn as sns

In [8]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()

Let's explore this variable by looking at the different attributes (keys) of `diabetes`. Note that the `load_diabetes` function does not return dataframes. It returns you a Python dictionary.

Exploremos esta variable mirando los diferentes atributos (claves) de `diabetes`. Tenga en cuenta que la función `load_diabetes` no devuelve marcos de datos. Te devuelve un diccionario Python.

In [9]:
print(diabetes.keys())

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])


#### The next step is to read the description of the dataset. 

Print the description in the cell below using the `DESCR` attribute of the `diabetes` variable. Read the data description carefully to fully understand what each column represents.

*Hint: If your output is ill-formatted by displaying linebreaks as `\n`, it means you are not using the `print` function.*

#### El siguiente paso es leer la descripción del conjunto de datos.

Imprima la descripción en la celda de abajo usando el atributo `DESCR` de la variable `diabetes`. Lea atentamente la descripción de los datos para comprender completamente lo que representa cada columna.

*Sugerencia: si su salida tiene un formato incorrecto al mostrar saltos de línea como `\n`, significa que no está utilizando la función `imprimir`.*

In [10]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

#### Based on the data description, answer the following questions:

1. How many attributes are there in the data? What do they mean?

1. What is the relation between `diabetes['data']` and `diabetes['target']`?

1. How many records are there in the data?

#### Con base en la descripción de los datos, responda las siguientes preguntas:

1. ¿Cuántos atributos hay en los datos? ¿Qué quieren decir? Tenemos 10 atributos(edad, sexo, indice de masa corporal, presión alterial...) Son valores predictivos, valores cuantitativos. 10 columnas con valores predictivos numéricos. Variables Dependientes

1. ¿Cuál es la relación entre `diabetes['data']` y `diabetes['target']`? Pues la relación que hay entre ellas son que son los mismos datos que vamos a relacionar, sólo que el target contiene una variable o atributo independiente que el otro no tiene, que nos permitirá sacar conclusiones o predicciones. En este caso concreto la variable independiente es la columna 11 que será una media cuantitativa de la progresión de la enfermedad un año después de su inicio.

1. ¿Cuántos registros hay en los datos? Hay 442 registros de 442 pacientes diabéticos.

#### Now explore what are contained in the *data* portion as well as the *target* portion of `diabetes`. 

Scikit-learn typically takes in 2D numpy arrays as input (though pandas dataframes are also accepted). Inspect the shape of `data` and `target`. Confirm they are consistent with the data description.

#### Ahora explore lo que contiene la porción de *datos* así como la porción de *objetivo* de `diabetes`.

Scikit-learn generalmente toma matrices numpy 2D como entrada (aunque también se aceptan marcos de datos de pandas). Inspeccione la forma de `data` y `target`. Confirme que sean consistentes con la descripción de los datos.

In [18]:
df_diabetes = pd.DataFrame(data = diabetes['data'],columns = diabetes['feature_names'])

In [19]:
df_diabetes

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485
439,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044528,-0.025930


In [21]:
df_diabetes['target'] = diabetes['target']

In [25]:
df_diabetes

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044528,-0.025930,220.0


In [23]:
df_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB


In [27]:
df_diabetes.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6',
       'target'],
      dtype='object')

In [28]:
X = df_diabetes[['age', 'sex', 'bmi', 'bp', 's1', 's2',
                's3', 's4', 's5', 's6']]

In [29]:
y = df_diabetes['target']

In [31]:
X

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485
439,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044528,-0.025930


In [32]:
y

0      151.0
1       75.0
2      141.0
3      206.0
4      135.0
       ...  
437    178.0
438    104.0
439    132.0
440    220.0
441     57.0
Name: target, Length: 442, dtype: float64

## Buliding a regression model

The data have already been split to predictor (*data*) and response (*target*) variables. Given this information, we'll apply what we have previously learned about linear regression and apply the algorithm to the diabetes dataset.

#### In the cell below, import the `linear_model` class from `sklearn`. 

Los datos ya se han dividido en variables predictoras (*data*) y de respuesta (*target*). Dada esta información, aplicaremos lo que hemos aprendido previamente sobre la regresión lineal y aplicaremos el algoritmo al conjunto de datos de diabetes.

#### En la celda de abajo, importa la clase `linear_model` desde `sklearn`.

In [62]:
from sklearn.linear_model import LinearRegression

#### Create a new instance of the linear regression model and assign the new instance to the variable `diabetes_model`.

#### Cree una nueva instancia del modelo de regresión lineal y asigne la nueva instancia a la variable `diabetes_model`.

In [63]:
diabetes_model = LinearRegression()
print(diabetes_model)
print(type(diabetes_model))

LinearRegression()
<class 'sklearn.linear_model._base.LinearRegression'>


#### Next, let's split the training and test data.

Define `diabetes_data_train`, `diabetes_target_train`, `diabetes_data_test`, and `diabetes_target_test`. Use the last 20 records for the test data and the rest for the training data.

#### A continuación, dividamos los datos de entrenamiento y prueba.

Defina `diabetes_data_train`, `diabetes_target_train`, `diabetes_data_test` y `diabetes_target_test`. Utilice los últimos 20 registros para los datos de prueba y el resto para los datos de entrenamiento.

In [60]:
X_diabetes_data_train, X_diabetes_data_test = X.iloc[:421], X.iloc[422:]
y_diabetes_target_train, y_diabetes_target_test = y[:421], y[422:]

In [61]:
from sklearn.model_selection import train_test_split

X_diabetes_data_train, X_diabetes_data_test, y_diabetes_target_train, y_diabetes_target_test = train_test_split(X,
                                                                                            y,
                                                                                            train_size=0.955,
                                                                                            random_state=0)

print(f"X_train: {X_diabetes_data_train.shape}\ty_train: {y_diabetes_target_train.shape}")
print(f"X_test: {X_diabetes_data_test.shape}\ty_test: {y_diabetes_target_test.shape}")

X_train: (422, 10)	y_train: (422,)
X_test: (20, 10)	y_test: (20,)


Fit the training data and target to `diabetes_model`. Print the *intercept* and *coefficients* of the model.

In [64]:
diabetes_model.fit(X_diabetes_data_train, y_diabetes_target_train)

LinearRegression()

In [68]:
cols = X_diabetes_data_train.columns
params = diabetes_model.coef_
inter = diabetes_model.intercept_

In [66]:
print(diabetes_model.coef_)

[ -32.3074285  -257.44432972  513.31945939  338.46656647 -766.86983748
  455.85416891   92.55795582  184.75163454  734.92318647   82.7231425 ]


In [67]:
print(diabetes_model.intercept_)

152.39189054201842


In [70]:
df_params = pd.DataFrame({'Variable': cols, 'Coef': params, 'Intercept': inter})
df_params

Unnamed: 0,Variable,Coef,Intercept
0,age,-32.307428,152.391891
1,sex,-257.44433,152.391891
2,bmi,513.319459,152.391891
3,bp,338.466566,152.391891
4,s1,-766.869837,152.391891
5,s2,455.854169,152.391891
6,s3,92.557956,152.391891
7,s4,184.751635,152.391891
8,s5,734.923186,152.391891
9,s6,82.723143,152.391891


#### Inspecting the results

From the outputs you should have seen:

- The intercept is a float number.
- The coefficients are an array containing 10 float numbers.

This is the linear regression model fitted to your training dataset.

#### Using your fitted linear regression model, predict the *y* of `diabetes_data_test`.

#### Print your `diabetes_target_test` and compare with the prediction. 

#### Is `diabetes_target_test` exactly the same as the model prediction?

#### Which are the most important features?