Predicción de la evolución de la diabetes en pacientes.
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/06-(P)-regression-diabetes.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/06-(P)-regression-diabetes.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)

# Definición del problema real

# Definición del problema en términos de los datos

"Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."

In [1]:
## Carga los datos
diabetes = open('data/diabetes.csv').readlines()

## Elimina el retorno de carro de los strings
diabetes = [x[:-1] for x in diabetes]

## Convierte cada linea en una lista de strings, 
## partiendo la línea original por las comas.
diabetes = [x.split(',') for x in diabetes]


## Almacena los nombres de las columnas
colnames = diabetes[0]

## Elimina la primera fila (nombres de las columnas)
diabetes = diabetes[1:]

## Convierte todas las columnas a float
diabetes = [[float(y) for y in x] for x in diabetes]

## Imprime los dos primeros registros
diabetes[0:2]

[[0.0380759064334241,
  0.0506801187398187,
  0.0616962065186885,
  0.0218723549949558,
  -0.0442234984244464,
  -0.0348207628376986,
  -0.0434008456520269,
  -0.00259226199818282,
  0.0199084208763183,
  -0.0176461251598052,
  151.0],
 [-0.00188201652779104,
  -0.044641636506989,
  -0.0514740612388061,
  -0.0263278347173518,
  -0.00844872411121698,
  -0.019163339748222,
  0.0744115640787594,
  -0.0394933828740919,
  -0.0683297436244215,
  -0.09220404962683,
  75.0]]

In [2]:
## Separa los datos de entrada y la salida
diabetes_x = [x[0:-1] for x in diabetes]
diabetes_y = [x[-1] for x in diabetes]

In [3]:
## Separa en entrenamiento y validación
diabetes_x_train = diabetes_x[0:370]
diabetes_y_train = diabetes_y[0:370]
diabetes_x_test  = diabetes_x[370:]
diabetes_y_test  = diabetes_y[370:]

In [4]:
## Estimación del modelo
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(diabetes_x_train, diabetes_y_train)



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [5]:
## Pronóstico
diabetes_y_pred = regr.predict(diabetes_x_test)
diabetes_y_pred

array([ 59.69047117, 191.93382655, 108.15688272, 143.95176666,
       123.85807651, 177.9931898 , 210.09825351, 163.62450546,
       160.55295745, 141.03304378, 176.70822194,  70.48546737,
       251.08339846, 115.44550989, 108.65311778, 139.40541104,
       113.43509423,  94.77375882, 159.22272638,  73.11120928,
       255.5294549 ,  57.22038646, 102.34715027, 101.76633061,
       259.08489675, 165.92925634,  62.84534834, 185.96500119,
       168.64021247, 189.24837812, 186.53204806,  94.71948322,
       151.27241082, 253.5543533 , 200.57642677, 279.92133129,
        46.80914895, 179.26792936, 204.57619804, 168.00580261,
       152.12336002, 155.5243837 , 239.70877199, 125.17211055,
       163.49352153, 172.84495053, 229.11423174, 154.22282782,
       102.09394021,  85.85752276, 146.47709845, 189.72113642,
       194.12952408, 146.76076634, 171.32188923, 111.37004453,
       164.39270378, 132.94833105, 262.66099948, 100.22920651,
       116.41780149, 123.56593832, 224.32202069,  62.97

In [6]:
from sklearn.metrics import mean_squared_error, r2_score

## Coeficientes
print('Coefficients: \n', regr.coef_)

## MSE
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

## R2
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

Coefficients: 
 [  16.86461017 -259.84025933  525.998963    302.24138762 -652.73606616
  367.19436984   31.01925618  139.02402832  666.95531762  126.15988894]
Mean squared error: 2484.78
Variance score: 0.57


In [7]:
## Si la predicción es perfecta, los puntos
## quedan ubicados sobre la linea a 45 grados.
import matplotlib.pyplot as plt
plt.scatter(diabetes_y_test, diabetes_y_pred,  color='black')
plt.plot([min(diabetes_y_test), max(diabetes_y_test)], 
         [min(diabetes_y_test), max(diabetes_y_test)])

[<matplotlib.lines.Line2D at 0x10837f630>]

In [8]:
## Regresión usando statsmodels con formulas como en R
import statsmodels.formula.api as smf
import pandas as pd


## Lee los datos en un dataframe
diabetes_df = pd.read_csv('data/diabetes.csv')
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,Y
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [9]:
## Separa los datos en conjuntos de entrenamiento y prueba
diabetes_df_train = diabetes_df.iloc[0:370,]
diabetes_df_test  = diabetes_df.iloc[370:,]

In [10]:
diabetes_df_train.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,Y
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [11]:
diabetes_df_test.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,Y
370,0.019913,-0.044642,-0.057941,-0.057314,-0.001569,-0.012587,0.074412,-0.039493,-0.061177,-0.075636,63.0
371,0.052606,0.05068,-0.009439,0.049415,0.050717,-0.019163,-0.013948,0.034309,0.119344,-0.017646,197.0
372,-0.02731,0.05068,-0.023451,-0.015999,0.013567,0.012778,0.02655,-0.002592,-0.010904,-0.021788,71.0
373,-0.074533,-0.044642,-0.010517,-0.005671,-0.066239,-0.057054,-0.002903,-0.039493,-0.042572,-0.001078,168.0
374,-0.107226,-0.044642,-0.034229,-0.067642,-0.063487,-0.07052,0.008142,-0.039493,-0.000609,-0.079778,140.0


In [12]:
## Ajusta el modelo
mod = smf.ols(formula='Y ~ age+sex+bmi+bp+s1+s2+s3+s4+s5+s6', data=diabetes_df_train)
res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.505
Model:,OLS,Adj. R-squared:,0.492
Method:,Least Squares,F-statistic:,36.67
Date:,"Tue, 13 Mar 2018",Prob (F-statistic):,4.16e-49
Time:,21:42:19,Log-Likelihood:,-2002.7
No. Observations:,370,AIC:,4027.0
Df Residuals:,359,BIC:,4071.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,152.6788,2.865,53.285,0.000,147.044,158.314
age,16.8646,65.360,0.258,0.797,-111.671,145.401
sex,-259.8403,68.076,-3.817,0.000,-393.719,-125.962
bmi,525.9990,74.584,7.052,0.000,379.323,672.675
bp,302.2414,72.980,4.141,0.000,158.719,445.764
s1,-652.7361,471.598,-1.384,0.167,-1580.178,274.706
s2,367.1944,385.973,0.951,0.342,-391.858,1126.247
s3,31.0193,240.536,0.129,0.897,-442.018,504.057
s4,139.0240,177.161,0.785,0.433,-209.380,487.428

0,1,2,3
Omnibus:,2.477,Durbin-Watson:,1.983
Prob(Omnibus):,0.29,Jarque-Bera (JB):,2.037
Skew:,0.042,Prob(JB):,0.361
Kurtosis:,2.646,Cond. No.,232.0


In [13]:
res.params

Intercept    152.678798
age           16.864610
sex         -259.840259
bmi          525.998963
bp           302.241388
s1          -652.736066
s2           367.194370
s3            31.019256
s4           139.024028
s5           666.955318
s6           126.159889
dtype: float64

In [14]:
## Predicción out-of-sample
diabetes_df_y_pred = res.predict(diabetes_df_test)
diabetes_df_y_pred

370     59.690471
371    191.933827
372    108.156883
373    143.951767
374    123.858077
375    177.993190
376    210.098254
377    163.624505
378    160.552957
379    141.033044
380    176.708222
381     70.485467
382    251.083398
383    115.445510
384    108.653118
385    139.405411
386    113.435094
387     94.773759
388    159.222726
389     73.111209
390    255.529455
391     57.220386
392    102.347150
393    101.766331
394    259.084897
395    165.929256
396     62.845348
397    185.965001
398    168.640212
399    189.248378
          ...    
412    239.708772
413    125.172111
414    163.493522
415    172.844951
416    229.114232
417    154.222828
418    102.093940
419     85.857523
420    146.477098
421    189.721136
422    194.129524
423    146.760766
424    171.321889
425    111.370045
426    164.392704
427    132.948331
428    262.660999
429    100.229207
430    116.417801
431    123.565938
432    224.322021
433     62.974631
434    137.439188
435    120.817187
436     53

# Ejercicio.-- Predicción de gastos médicos

### Definición del problema real

Una compañía de seguros desea pronósticar los gastos médicos de la población asegurada con el fin de recolectar un valor superior en ingresos, tal que le permita obtener utilidades. Los costos son difíciles de pronósticar ya que: las condiciones más costosas son más raras y parecen aleatorias; y que ciertas condiciones son más probables para ciertos segmentos de la población (infarto en personas obesas y cáncer en fumadores).

### Definición del problema en términos de los datos

El objetivo es usar una base de datos con 1338 registros de gastos médicos hipotéticos para pacientes de EU con el fin de estimar los costos para determinados segmentos de la población. La información registrada es la siguiente:

* Age: entero hasta 64.

* Sex: male, female.

* bmi: Body mass index.

* children: entero indicando la cantidad de hijos/dependientes cubiertos por el plan de salud.

* smoker: yes, no.

* region: northest, southeast, southwest, northwest.

* charges: costos.


In [15]:
import pandas as pd
insurance = pd.read_csv('data/insurance.csv')
insurance.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


# Ejerccio.-- Predicción de vistas de páginas web.

### Definición del problema real

En este problema se desea encontrar un modelo que permita pronósticar las visitas a páginas web.

### Definición del problema en términos de los datos

El archivo `data/top_1000_sites.tsv` contiene la cantidad de vistas para los 1000 principales sitios de internet. El archivo esta delimitado por tabuladores y contiene cinco columnas con información numérica: Rank, PageViews, UniqueVisitors, HasAdvertising y IsEnglish. Se desea pronosticar la columna PageViews.

### Requerimientos

Para este problema usted debe construir un modelo de regresión. Tenga en cuenta lo siguiente:


* Podría ser apropiado aplicar alguna transformación a las variables (tanto la dependiente como las independientes) en el modelo (logaritmo natural, logaritmo base 10, Box-Cox, ...)


* No todas las variables podrían ser explicativas.


# Ejercicio.-- Predicción del precio de vehículos usados.

### Definición del problema real

Se debe construir un modelo para pronósticar el precio de venta de autos usados a partir de sus características. El sistema se usará para establecer políticas de compra y venta de vehículos.

### Definición del problema en términos de los datos

El archivo `data/cars.tsv` contiene 804 registros de los precios de venta y las características de diferentes tipos de autos. Se desea pronósticar la columna `price`.

El archivo `data/cars.tsv` contiene 804 registros de los precios de venta y las características de diferentes tipos de autos. Se desea pronósticar la columna `price`.

---

Predicción de la evolución de la diabetes en pacientes.
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/06-(P)-regression-diabetes.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/06-(P)-regression-diabetes.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)