Análisis del desempeño de modelos
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/05-model-performance.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/05-model-performance.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)

# Partición de los datos

### train_test_split

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split

##
## Crea los datos
##
X, y = np.arange(10).reshape((5, 2)), range(5)
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [2]:
list(y)

[0, 1, 2, 3, 4]

In [8]:
##
## Partición de los datos
##
X_train, X_test, y_train, y_test = train_test_split(
    X, y,             # datos originales
    test_size=2,   # float/int, tamaño de la muestra de prueba
    random_state=42)  # semilla del generador aleatorio

In [8]:
##
## Muestra de entrenamiento
##
X_train

array([[4, 5],
       [0, 1],
       [6, 7]])

In [9]:
y_train

[2, 0, 3]

In [10]:
##
## Muestra de prueba
##
X_test

array([[2, 3],
       [8, 9]])

In [11]:
y_test

[1, 4]

### KFold

In [15]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d", "e", "f", "g", "h"]
kf = KFold(n_splits=4)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3 4 5 6 7] [0 1]
[0 1 4 5 6 7] [2 3]
[0 1 2 3 6 7] [4 5]
[0 1 2 3 4 5] [6 7]


### Repeated KFold

In [18]:
from sklearn.model_selection import RepeatedKFold
##
## set de entrenamiento
##
X = np.array([[ 1,  2], 
              [ 3,  4], 
              [ 5,  6], 
              [ 7,  8],
              [ 9, 10],
              [11, 12],
              [13, 14],
              [15, 16]])

##
## Se repite K-Fold n veces
rkf = RepeatedKFold(n_splits=4, 
                    n_repeats=2, 
                    random_state=123)

for train, test in rkf.split(X):
    print("%s %s" % (train, test))

[2 3 4 5 6 7] [0 1]
[0 1 2 4 5 6] [3 7]
[0 1 3 5 6 7] [2 4]
[0 1 2 3 4 7] [5 6]
[1 2 3 4 6 7] [0 5]
[0 1 2 3 5 7] [4 6]
[0 2 3 4 5 6] [1 7]
[0 1 4 5 6 7] [2 3]


### Leave-One-Out (LOO)

In [19]:
from sklearn.model_selection import LeaveOneOut
##
## set de entrenamiento
##
X = np.array([[ 1,  2], 
              [ 3,  4], 
              [ 5,  6], 
              [ 7,  8],
              [ 9, 10],
              [11, 12],
              [13, 14],
              [15, 16]])

loo = LeaveOneOut()

for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3 4 5 6 7] [0]
[0 2 3 4 5 6 7] [1]
[0 1 3 4 5 6 7] [2]
[0 1 2 4 5 6 7] [3]
[0 1 2 3 5 6 7] [4]
[0 1 2 3 4 6 7] [5]
[0 1 2 3 4 5 7] [6]
[0 1 2 3 4 5 6] [7]


### Leave-P-Out (LPO)

In [20]:
from sklearn.model_selection import LeavePOut
##
## set de entrenamiento
##
X = np.array([[ 1,  2], 
              [ 3,  4], 
              [ 5,  6], 
              [ 7,  8],
              [ 9, 10],
              [11, 12],
              [13, 14],
              [15, 16]])

lpo = LeavePOut(p=3)

for train, test in lpo.split(X):
    print("%s %s" % (train, test))

[3 4 5 6 7] [0 1 2]
[2 4 5 6 7] [0 1 3]
[2 3 5 6 7] [0 1 4]
[2 3 4 6 7] [0 1 5]
[2 3 4 5 7] [0 1 6]
[2 3 4 5 6] [0 1 7]
[1 4 5 6 7] [0 2 3]
[1 3 5 6 7] [0 2 4]
[1 3 4 6 7] [0 2 5]
[1 3 4 5 7] [0 2 6]
[1 3 4 5 6] [0 2 7]
[1 2 5 6 7] [0 3 4]
[1 2 4 6 7] [0 3 5]
[1 2 4 5 7] [0 3 6]
[1 2 4 5 6] [0 3 7]
[1 2 3 6 7] [0 4 5]
[1 2 3 5 7] [0 4 6]
[1 2 3 5 6] [0 4 7]
[1 2 3 4 7] [0 5 6]
[1 2 3 4 6] [0 5 7]
[1 2 3 4 5] [0 6 7]
[0 4 5 6 7] [1 2 3]
[0 3 5 6 7] [1 2 4]
[0 3 4 6 7] [1 2 5]
[0 3 4 5 7] [1 2 6]
[0 3 4 5 6] [1 2 7]
[0 2 5 6 7] [1 3 4]
[0 2 4 6 7] [1 3 5]
[0 2 4 5 7] [1 3 6]
[0 2 4 5 6] [1 3 7]
[0 2 3 6 7] [1 4 5]
[0 2 3 5 7] [1 4 6]
[0 2 3 5 6] [1 4 7]
[0 2 3 4 7] [1 5 6]
[0 2 3 4 6] [1 5 7]
[0 2 3 4 5] [1 6 7]
[0 1 5 6 7] [2 3 4]
[0 1 4 6 7] [2 3 5]
[0 1 4 5 7] [2 3 6]
[0 1 4 5 6] [2 3 7]
[0 1 3 6 7] [2 4 5]
[0 1 3 5 7] [2 4 6]
[0 1 3 5 6] [2 4 7]
[0 1 3 4 7] [2 5 6]
[0 1 3 4 6] [2 5 7]
[0 1 3 4 5] [2 6 7]
[0 1 2 6 7] [3 4 5]
[0 1 2 5 7] [3 4 6]
[0 1 2 5 6] [3 4 7]
[0 1 2 4 7] [3 5 6]


### ShuffleSplit

In [21]:
from sklearn.model_selection import ShuffleSplit

X = np.array([[ 1,  2], 
              [ 3,  4], 
              [ 5,  6], 
              [ 7,  8],
              [ 9, 10],
              [11, 12],
              [13, 14],
              [15, 16]])

ss = ShuffleSplit(n_splits=3, 
                  test_size=0.25,
                  random_state=0)

for train_index, test_index in ss.split(X):
    print("%s %s" % (train_index, test_index))

[1 7 3 0 5 4] [6 2]
[3 7 0 4 2 5] [1 6]
[3 4 7 0 6 1] [5 2]


### Stratifed k-fold

Se usa en problemas de clasificación en los que la distribución porcentual de las clases en los grupos de entrenamiento y prueba son similares a los de la muestra original.

In [22]:
from sklearn.model_selection import StratifiedKFold

X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

skf = StratifiedKFold(n_splits=3)

for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


# Ajuste de los hiperparámetros de modelos

In [23]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

##
## carga el dataset
##
digits = datasets.load_digits()

##
## Separa los datos
## 
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

##
## Parte los datos en dos conjutos iguales
##
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.5, 
    random_state=0)

##
## Aca se usara una SVM. Dependiendo del tipo de kernel
## cambian los parámetros que pueden ajustarse.
##
## La variable tuned_parameters es una lista de diccionarios
## que contiene los valores que pueden ajustarse
##
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

##
## Se definen las métricas de precisión que se usarán
##
scores = ['precision', 'recall']


for score in scores:

    ##
    ## construye el clasificador
    ##
    clf = GridSearchCV(SVC(), 
                       tuned_parameters, 
                       cv=5,
                       scoring='%s_macro' % score)
    
    ##
    ## entrenamiento
    ##
    clf.fit(X_train, y_train)

    ##
    ## La variable clf.best_params_ contiene los mejores parámetros
    ## La variable clf.cv_results_ almacena los resultados de la corrida
    ##
    stds = clf.cv_results_['std_test_score']
    
    ##
    ## valores real y pronosticado
    y_true, y_pred = y_test, clf.predict(X_test)
    print(' ')
    print(classification_report(y_true, y_pred))



 
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899

 
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      

---


Análisis del desempeño de modelos
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/05-model-performance.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/05-model-performance.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)