# Ejercicio de machine learning: clasificación y regresión vinícola

En este ejercicio (mucho menos guiado que los anteriores) vas a tener dos objetivos. Para ello, utilizarás un dataset sobre distintos vinos con sus características (como pueden ser la acidez, densidad...). Tendrás que generar, entrenar, validar y testear modelos tanto de clasificación como de regresión.

El dataset proviene de la Universdad de Minho, generado por [P. Cortez](http://www3.dsi.uminho.pt/pcortez/Home.html) et al. Dicho dataset se encuentra en el [*UC Irvine Machine Learning Repository*](https://archive.ics.uci.edu/ml/index.html) ([aquí](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) está disponible; pero debes usar la versión adjunta en la misma carpeta que este documento). Adjunto la descripción del dataset:

```
Citation Request:
  This dataset is public available for research. The details are described in [Cortez et al., 2009]. 
  Please include this citation if you plan to use this database:

  P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
                [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
                [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

1. Title: Wine Quality 

2. Sources
   Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
   
3. Past Usage:

  P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  In the above reference, two datasets were created, using red and white wine samples.
  The inputs include objective tests (e.g. PH values) and the output is based on sensory data
  (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
  between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
  these datasets under a regression approach. The support vector machine model achieved the
  best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
  etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
  analysis procedure).
 
4. Relevant Information:

   The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
   For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
   Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables 
   are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

   These datasets can be viewed as classification or regression tasks.
   The classes are ordered and not balanced (e.g. there are munch more normal wines than
   excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
   or poor wines. Also, we are not sure if all input variables are relevant. So
   it could be interesting to test feature selection methods. 

5. Number of Instances: red wine - 1599; white wine - 4898. 

6. Number of Attributes: 11 + output attribute
  
   Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
   feature selection.

7. Attribute information:

   For more information, read [Cortez et al., 2009].

   Input variables (based on physicochemical tests):
   1 - fixed acidity
   2 - volatile acidity
   3 - citric acid
   4 - residual sugar
   5 - chlorides
   6 - free sulfur dioxide
   7 - total sulfur dioxide
   8 - density
   9 - pH
   10 - sulphates
   11 - alcohol
   Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

8. Missing Attribute Values: None
```

Además de las 12 variables descritas, el dataset que utilizarás tiene otra: si el vino es blanco o rojo. Dicho esto, los objetivos son:

1. Separar el dataset en training (+ validación si no vas a hacer validación cruzada) y testing, haciendo antes (o después) las transformaciones de los datos que consideres oportunas, así como selección de variables, reducción de dimensionalidad... Puede que decidas usar los datos tal cual vienen también...
2. Hacer un modelo capaz de clasificar lo mejor posible si un vino es blanco o rojo a partir del resto de variables (vas a ver que está chupado conseguir un muy buen resultado).
3. Hacer un modelo regresor que prediga lo mejor posible la calidad de los vinos.

El fichero csv a utilizar `winequality.csv` tiene las cabeceras de cuál es cada variable, y los datos están separados por punto y coma.

Siéntete libre de hacer todo el análisis exploratorio y estadístico (así como gráficos) que quieras antes de lanzarte a hacer modelos.

Y nada más. ¡Ánimo!

# 1. Preparación del conjunto de datos

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
crudo=pd.read_csv('winequality.csv',sep=";")
print(type(crudo))
crudo.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6,red
1,6.2,0.55,0.45,12.0,0.049,27.0,186.0,0.9974,3.17,0.5,9.3,6,white
2,7.15,0.17,0.24,9.6,0.119,56.0,178.0,0.99578,3.15,0.44,10.2,6,white
3,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5,red
4,7.6,0.23,0.34,1.6,0.043,24.0,129.0,0.99305,3.12,0.7,10.4,5,white


La variable color (label), a partir de la cual se van a clasificar los vinos en dos categorías: blancos y tintos, es categórica. En primer lugar se transforma en variable numérica.
Defino una función que convierta "red" en 1 y "white" en 0:

In [4]:
def renombrar(elemento):
    if elemento =="red":
        return 1
    elif elemento=="white":
        return 0

Aplico esta función a la columna color con el método apply:

In [8]:
crudo["color"]=np.array(crudo["color"].apply(renombrar))

In [9]:
crudo.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6,1
1,6.2,0.55,0.45,12.0,0.049,27.0,186.0,0.9974,3.17,0.5,9.3,6,0
2,7.15,0.17,0.24,9.6,0.119,56.0,178.0,0.99578,3.15,0.44,10.2,6,0
3,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5,1
4,7.6,0.23,0.34,1.6,0.043,24.0,129.0,0.99305,3.12,0.7,10.4,5,0


Es necesario estandarizar las features, excepto la label, antes de emplearlas en algunos de los modelos de clasificación. Para ello se corta la columna color (label) con el método pop, se estandarizan las otras 12 variables (features) y luego se vuelve a añadir la columna color.

In [11]:
color=crudo.pop("color")

In [12]:
crudo.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6
1,6.2,0.55,0.45,12.0,0.049,27.0,186.0,0.9974,3.17,0.5,9.3,6
2,7.15,0.17,0.24,9.6,0.119,56.0,178.0,0.99578,3.15,0.44,10.2,6
3,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5
4,7.6,0.23,0.34,1.6,0.043,24.0,129.0,0.99305,3.12,0.7,10.4,5


Ahora se estandarizan las restantes variables (features), es decir, se hace que sus medias sean cero restando las medias a cada valor y que sus desviaciones estándar sean uno diviendo cada valor de cada variable por su respectiva desviación.
Para hacer esto se crea una estancia de la clase StandardScaler:

In [13]:
escalador=StandardScaler()
escalador.fit(crudo)
crudoesc=escalador.transform(crudo)

In [18]:
crudo.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6
1,6.2,0.55,0.45,12.0,0.049,27.0,186.0,0.9974,3.17,0.5,9.3,6
2,7.15,0.17,0.24,9.6,0.119,56.0,178.0,0.99578,3.15,0.44,10.2,6
3,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5
4,7.6,0.23,0.34,1.6,0.043,24.0,129.0,0.99305,3.12,0.7,10.4,5


In [19]:
crudoesc.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'

In [17]:
type(crudoesc)

numpy.ndarray

Hay que volver a convertir los datos en un DataFrame después de estandarizar:

In [20]:
crudoesc=pd.DataFrame(crudoesc)

In [21]:
crudoesc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,-1.55462,0.002029,-2.192833,-0.765798,-0.172244,-0.198632,-0.933243,-1.032748,2.870469,1.738854,2.94159,0.207999
1,-0.783214,1.277665,0.904066,1.378213,-0.20079,-0.198632,1.243074,0.90159,-0.301669,-0.210144,-0.999313,0.207999
2,-0.050378,-1.030629,-0.541153,0.87374,1.797445,1.435352,1.101525,0.36131,-0.426067,-0.613385,-0.244672,0.207999
3,-0.397511,1.824366,-0.609973,-0.702739,0.684143,-1.10014,0.0576,0.227907,0.880108,1.133992,0.342271,-0.93723
4,0.296754,-0.666161,0.147046,-0.807837,-0.372068,-0.367664,0.234537,-0.549163,-0.612663,1.133992,-0.076974,-0.93723


Vuelvo a poner el nombre a las columnas:

In [22]:
crudoesc.columns=crudo.columns

In [23]:
crudoesc.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,-1.55462,0.002029,-2.192833,-0.765798,-0.172244,-0.198632,-0.933243,-1.032748,2.870469,1.738854,2.94159,0.207999
1,-0.783214,1.277665,0.904066,1.378213,-0.20079,-0.198632,1.243074,0.90159,-0.301669,-0.210144,-0.999313,0.207999
2,-0.050378,-1.030629,-0.541153,0.87374,1.797445,1.435352,1.101525,0.36131,-0.426067,-0.613385,-0.244672,0.207999
3,-0.397511,1.824366,-0.609973,-0.702739,0.684143,-1.10014,0.0576,0.227907,0.880108,1.133992,0.342271,-0.93723
4,0.296754,-0.666161,0.147046,-0.807837,-0.372068,-0.367664,0.234537,-0.549163,-0.612663,1.133992,-0.076974,-0.93723


Y, por último, añado la columna de color (la variable llamada label).
Para ello importo concat:

In [15]:
from pandas import concat

In [24]:
vinoesc=concat([crudoesc,color],axis=1)

In [26]:
vinoesc.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,-1.55462,0.002029,-2.192833,-0.765798,-0.172244,-0.198632,-0.933243,-1.032748,2.870469,1.738854,2.94159,0.207999,1
1,-0.783214,1.277665,0.904066,1.378213,-0.20079,-0.198632,1.243074,0.90159,-0.301669,-0.210144,-0.999313,0.207999,0
2,-0.050378,-1.030629,-0.541153,0.87374,1.797445,1.435352,1.101525,0.36131,-0.426067,-0.613385,-0.244672,0.207999,0
3,-0.397511,1.824366,-0.609973,-0.702739,0.684143,-1.10014,0.0576,0.227907,0.880108,1.133992,0.342271,-0.93723,1
4,0.296754,-0.666161,0.147046,-0.807837,-0.372068,-0.367664,0.234537,-0.549163,-0.612663,1.133992,-0.076974,-0.93723,0


Compruebo que la media y desviación de las doce features se sitúan muy cerca de cero y uno, respectivamente, mientras que la label no ha sido estandarizada:

In [27]:
vinoesc.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,-4.211183e-16,-2.163231e-16,-5.579311e-18,8.373237999999999e-19,7.566673e-17,-1.034629e-16,-5.2597610000000006e-17,-3.581405e-15,2.709238e-15,2.659358e-16,9.520714e-16,-5.872545e-16,0.246114
std,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,1.000077,0.430779
min,-2.634589,-1.57733,-2.192833,-1.018034,-1.342639,-1.663583,-1.94178,-2.530192,-3.100615,-2.091935,-2.08935,-3.227687,0.0
25%,-0.6289329,-0.6661613,-0.4723335,-0.7657978,-0.5147986,-0.7620742,-0.6855323,-0.7859527,-0.6748622,-0.6805919,-0.8316152,-0.9372296,0.0
50%,-0.1660892,-0.3016939,-0.05941375,-0.5135612,-0.2578826,-0.08594301,0.03990667,0.06448888,-0.05287424,-0.1429373,-0.1608231,0.207999,0.0
75%,0.3738951,0.3664962,0.4911459,0.5584445,0.2559494,0.5901882,0.7122647,0.7648525,0.6313125,0.4619241,0.677667,0.207999,0.0
max,6.699425,7.534354,9.231281,12.68682,15.84219,14.56357,5.737257,14.76879,4.923029,9.870879,3.696231,3.643685,1.0


Ahora se separan los datos en datos de entrenamiento (80%) y datos de test (20%):

In [28]:
vinoescsep=train_test_split(vinoesc,train_size=0.8, test_size=0.2)
entreno=vinoescsep[0]
testeo=vinoescsep[1]

In [29]:
entreno[:5]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
6123,-0.706073,0.366496,0.009406,1.083937,-0.20079,0.421155,1.650027,1.135045,1.564294,0.394717,-0.999313,0.207999,0
6021,0.219614,0.366496,-1.366993,-0.513561,1.026697,-0.085943,-1.110179,0.668136,0.942306,1.133992,-0.160823,0.207999,1
624,-1.708901,-0.423183,0.559966,-0.891916,0.569958,0.646532,0.146068,-0.886005,0.631312,-0.411765,-0.328521,0.207999,0
5721,-0.474652,-0.483928,-0.747613,-0.534581,-0.857353,0.984598,0.181456,-1.269537,0.009325,-1.016626,0.677667,1.353228,0
822,0.142473,-0.605417,0.284686,-0.723758,-0.714622,-0.198632,0.411473,-1.382929,0.382517,-0.344558,1.683855,3.643685,0


In [30]:
testeo[:5]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
6077,-0.011808,-0.969884,0.628786,-0.891916,-0.229336,0.590188,-0.33166,-0.932696,-0.488266,-0.546178,-0.076974,-0.93723,0
3742,0.296754,-0.848395,-0.403514,-0.198265,0.85542,0.364811,0.305311,0.534733,-0.426067,-0.747799,-1.083162,-0.93723,0
1907,-0.24323,-0.240949,-0.472334,-0.450502,-0.42916,-0.254976,0.146068,-0.13895,-0.239471,-0.210144,-0.999313,0.207999,0
4429,-0.551792,1.763621,0.078226,-0.765798,0.084672,-0.818419,-1.55252,0.204562,0.880108,0.730751,-0.328521,0.207999,1
610,0.836739,-0.848395,1.179346,-0.786818,-0.457706,0.195778,0.942282,-0.299033,-1.048055,-1.083833,-0.328521,-0.93723,0


A continuación se va a analizar la matriz de correlación de las 13 variables por si hubiera alguna correlación cercana a 1 o -1 que pusiera de manifiesto la selección de alguna variable evidente.
Primero se pasan los datos de DataFrame a array para poder hallar la matriz de correlación:

In [116]:
vinoarr=np.array(vino)
print(type(vinoarr))
print(vinoarr[:3])

<class 'numpy.ndarray'>
[[  5.20000000e+00   3.40000000e-01   0.00000000e+00   1.80000000e+00
    5.00000000e-02   2.70000000e+01   6.30000000e+01   9.91600000e-01
    3.68000000e+00   7.90000000e-01   1.40000000e+01   6.00000000e+00
    1.00000000e+00]
 [  6.20000000e+00   5.50000000e-01   4.50000000e-01   1.20000000e+01
    4.90000000e-02   2.70000000e+01   1.86000000e+02   9.97400000e-01
    3.17000000e+00   5.00000000e-01   9.30000000e+00   6.00000000e+00
    0.00000000e+00]
 [  7.15000000e+00   1.70000000e-01   2.40000000e-01   9.60000000e+00
    1.19000000e-01   5.60000000e+01   1.78000000e+02   9.95780000e-01
    3.15000000e+00   4.40000000e-01   1.02000000e+01   6.00000000e+00
    0.00000000e+00]]


En el siguiente paso se calcula la matriz de correlación:

In [117]:
correlacion=np.corrcoef(vinoarr,rowvar=0) # rowvar=0 para que correlacione por columnas, no por filas.

correlaciondf=pd.DataFrame(correlacion) # se pasa de nuevo a DataFrame.

correlaciondf.columns=vino.columns # Se escriben de nuevo los nombres de las variables.

correlaciondf.index=correlaciondf.columns # Se renombran los índices para que se llamen como las columnas.

correlaciondf

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
fixed acidity,1.0,0.219008,0.324436,-0.111981,0.298195,-0.282735,-0.329054,0.45891,-0.2527,0.299568,-0.095452,-0.076743,0.48674
volatile acidity,0.219008,1.0,-0.377981,-0.196011,0.377124,-0.352557,-0.414476,0.271296,0.261454,0.225984,-0.03764,-0.265699,0.653036
citric acid,0.324436,-0.377981,1.0,0.142451,0.038998,0.133126,0.195242,0.096154,-0.329808,0.056197,-0.010493,0.085532,-0.187397
residual sugar,-0.111981,-0.196011,0.142451,1.0,-0.12894,0.402871,0.495482,0.552517,-0.26732,-0.185927,-0.359415,-0.03698,-0.348821
chlorides,0.298195,0.377124,0.038998,-0.12894,1.0,-0.195045,-0.27963,0.362615,0.044708,0.395593,-0.256916,-0.200666,0.512678
free sulfur dioxide,-0.282735,-0.352557,0.133126,0.402871,-0.195045,1.0,0.720934,0.025717,-0.145854,-0.188457,-0.179838,0.055463,-0.471644
total sulfur dioxide,-0.329054,-0.414476,0.195242,0.495482,-0.27963,0.720934,1.0,0.032395,-0.238413,-0.275727,-0.26574,-0.041385,-0.700357
density,0.45891,0.271296,0.096154,0.552517,0.362615,0.025717,0.032395,1.0,0.011686,0.259478,-0.686745,-0.305858,0.390645
pH,-0.2527,0.261454,-0.329808,-0.26732,0.044708,-0.145854,-0.238413,0.011686,1.0,0.192123,0.121248,0.019506,0.329129
sulphates,0.299568,0.225984,0.056197,-0.185927,0.395593,-0.188457,-0.275727,0.259478,0.192123,1.0,-0.003029,0.038485,0.487218


Esta matriz de correlación se ha comprobado exportando los datos a Excel y calculando la matriz. Los resultados obtenidos son idénticos. Véase el anexo.
Se observa que no hay ninguna correlación suficientemente llamativa (cercana a 1 o -1) que permita seleccionar alguna variable de manera clara.

# 2. Clasificación
En este apartado se van a explorar diferentes modelos de manera informal. En el último subapartado se entrenarán y validarán varios modelos de forma rigurosa mediante validación cruzada usando la clase GridSearchCV de scikit-learn.

No se tienen demasiadas variables; así que no se realizará selección de ellas, sino que se emplearán todas. Una variable que permitiría clasificar fácilmente un vino en tinto o blanco sería la concentración de taninos, pero no se dispone de dicha variable. En cualquier caso, como se ha comentado, se usarán todas las variables disponibles. 
## 2.1. Regresión Logística

In [31]:
from sklearn.linear_model import LogisticRegression

Creo una instancia de la clase LogisticRegression con valores por defecto:

In [32]:
lr=LogisticRegression()

Entreno el modelo con el subconjunto de datos de entrenamiento. Incluyo todas las variables en el modelo porque no son muchas:

In [33]:
lr.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Ahora se hace el testing con el subconjunto de datos de testing:

In [34]:
lr.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99692307692307691

La última salida signfica que el modelo es capaz de clasificar los vinos en tintos y blancos con un **99,69%** de aciertos.

Creo otro modelo de regresión logística a partir de otra instancia de la clase LogisticRegression con otros parámetros diferentes a los que vienen por defecto:

In [35]:
lr2=LogisticRegression(penalty="l2",solver="lbfgs",max_iter=600)

In [36]:
lr2.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=600, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [37]:
lr2.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99692307692307691

Veo que aumentar el número máximo de iteraciones desde el valor por defecto, 100, hasta 600 no mejora el porcentaje de aciertos del modelo.
Voy a probar con menos iteraciones.

In [38]:
lr3=LogisticRegression(penalty="l2",solver="lbfgs",max_iter=30)

In [39]:
lr3.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=30, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [40]:
lr3.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99692307692307691

El resultado no ha variado. El modelo no mejora por mucho que se aumente el número de iteraciones.

## 2.2. Nearest Neighbors

In [41]:
from sklearn.neighbors import KNeighborsClassifier

In [42]:
knn=KNeighborsClassifier(n_neighbors=5,metric="euclidean",algorithm="brute")

In [43]:
knn.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [44]:
knn.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

Este modelo de Nearest Neighbors tiene una tasa de predicciones correctas del **99,77%**. Mejora, por tanto, los aciertos respecto al modelo de Logistic Regression.
Voy a generar otra instancia cambiando sus parámetros:

In [45]:
knn2=KNeighborsClassifier(n_neighbors=10,metric="euclidean",algorithm="brute")

In [46]:
knn2.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [47]:
knn2.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

No hay mejora respecto del anterior. Otro ejemplo:

In [48]:
knn3=KNeighborsClassifier(n_neighbors=5,algorithm="ball_tree")

In [49]:
knn3.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [50]:
knn3.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

A pesar de haber modificado el algoritmo de cálculo de *nearest neighbors* de fuerza bruta a *ball_tree*, no ha habido cambio en el porcentaje de aciertos.

In [51]:
knn4=KNeighborsClassifier(n_neighbors=10,algorithm="ball_tree")

In [52]:
knn4.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [53]:
knn4.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

In [54]:
knn5=KNeighborsClassifier(n_neighbors=5,algorithm="auto")

In [55]:
knn5.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [56]:
knn5.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

In [57]:
knn6=KNeighborsClassifier(n_neighbors=10,algorithm="auto")

In [58]:
knn6.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [59]:
knn6.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

In [60]:
knn7=KNeighborsClassifier(n_neighbors=2,algorithm="auto")

In [61]:
knn7.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=2, p=2,
           weights='uniform')

In [62]:
knn7.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99615384615384617

La tasa de aciertos de clasificación no varía a menos que se disminuya bastante el número de vecinos a usar en el modelo, como se aprecia en el ejemplo anterior.

In [63]:
knn8=KNeighborsClassifier(n_neighbors=5,metric="minkowski",algorithm="auto",p=3)

In [64]:
knn8.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=3,
           weights='uniform')

In [65]:
knn8.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99692307692307691

Con métrica de Minkowski la proporción de aciertos es algo inferior que al usar la euclídea.

## 2.3. Support Vector Machines

In [66]:
from sklearn.svm import SVC

In [67]:
svm=SVC(kernel="rbf")

In [68]:
svm.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [69]:
svm.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99769230769230766

El *score* obtenido es el mismo que con el modelo de Nearest Neighbors.

## 2.4. Neural networks

In [70]:
from sklearn.neural_network import MLPClassifier

In [71]:
mlp=MLPClassifier(hidden_layer_sizes=(15,15),
                  activation="logistic",
                  solver="sgd",
                  batch_size=10,
                  alpha=0.0,
                  learning_rate="adaptive",
                  learning_rate_init=0.015,
                  momentum=0.8,
                  nesterovs_momentum=True,
                  max_iter=500) 

In [72]:
mlp.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

MLPClassifier(activation='logistic', alpha=0.0, batch_size=10, beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15, 15), learning_rate='adaptive',
       learning_rate_init=0.015, max_iter=500, momentum=0.8,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='sgd', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [73]:
mlp.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99692307692307691

## 2.5. Árboles de decisión

Para modelos basados en árboles de decisión no es necesario estandarizar las variables. Puede, incluso, llegar a ser contraproducente. Por tanto, hay que manipular el conjunto de datos de forma distinta a como se ha hecho en los clasificadores anteriores.

In [78]:
vino=concat([crudo,color],axis=1)

In [79]:
vino.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6,1
1,6.2,0.55,0.45,12.0,0.049,27.0,186.0,0.9974,3.17,0.5,9.3,6,0
2,7.15,0.17,0.24,9.6,0.119,56.0,178.0,0.99578,3.15,0.44,10.2,6,0
3,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5,1
4,7.6,0.23,0.34,1.6,0.043,24.0,129.0,0.99305,3.12,0.7,10.4,5,0


In [80]:
vinosep=train_test_split(vino,train_size=0.8, test_size=0.2)
entrenos=vinosep[0]
testeos=vinosep[1]

In [81]:
from sklearn.tree import DecisionTreeClassifier

In [82]:
arbol1=DecisionTreeClassifier(criterion="entropy")

In [83]:
arbol1.fit(X=entrenos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entrenos["color"])

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [84]:
arbol1.score(X=testeos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeos["color"])

0.98615384615384616

In [85]:
arbol2=DecisionTreeClassifier(criterion="entropy",max_depth=3)

In [86]:
arbol2.fit(X=entrenos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entrenos["color"])

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [87]:
arbol2.score(X=testeos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeos["color"])

0.97230769230769232

In [88]:
arbol3=DecisionTreeClassifier(criterion="gini")

In [89]:
arbol3.fit(X=entrenos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entrenos["color"])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [90]:
arbol3.score(X=testeos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeos["color"])

0.9869230769230769

In [91]:
arbol1.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [92]:
arbol1.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99230769230769234

Los modelos de árboles de decisión han arrojado peor resultado que otros clasificadores previos. 
Curiosamente, se han obtenido mejores resultados con el conjunto de datos estandarizados que con aquel sin estandarizar.

## 2.6. Naïve Bayes

In [93]:
from sklearn.naive_bayes import GaussianNB

In [94]:
nb=GaussianNB()

In [95]:
nb.fit(X=entrenos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entrenos["color"])

GaussianNB(priors=None)

In [96]:
nb.score(X=testeos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeos["color"])

0.97153846153846157

In [97]:
nb.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

GaussianNB(priors=None)

In [98]:
nb.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.98538461538461541

Los modelos basados en Naïve Bayes han dado peor resultado que otros clasificadores previos. Se han obtenido mejores resultados con el conjunto de datos estandarizados que con aquel sin estandarizar.

## 2.7. Ensembles
### 2.7.1. Bagging

In [74]:
from sklearn.ensemble import BaggingClassifier

In [75]:
baglr=BaggingClassifier(lr,
                        n_estimators=50,
                        max_samples=0.3,
                        bootstrap=True,
                        bootstrap_features=True,
                        n_jobs=4)

In [76]:
baglr.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=True, max_features=1.0,
         max_samples=0.3, n_estimators=50, n_jobs=4, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [77]:
baglr.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99538461538461542

Hago otra prueba más definiendo una nueva instancia de regresión logística:

In [99]:
lr4 = LogisticRegression(penalty="l1",solver="liblinear")

In [100]:
baglr4=BaggingClassifier(lr4,
                        n_estimators=50,
                        max_samples=0.3,
                        bootstrap=True,
                        bootstrap_features=True,
                        n_jobs=4)

In [101]:
baglr4.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=True, max_features=1.0,
         max_samples=0.3, n_estimators=50, n_jobs=4, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [102]:
baglr4.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.99538461538461542

### 2.7.2. Random Forests

In [103]:
from sklearn.ensemble import RandomForestClassifier

In [104]:
rf = RandomForestClassifier(n_estimators=20, # 20 árboles
                            criterion="entropy", 
                            max_depth=3, 
                            min_samples_split=10,
                            min_samples_leaf=5,
                            bootstrap=True)

In [105]:
rf.fit(X=entrenos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entrenos["color"])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [106]:
rf.score(X=testeos[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeos["color"])

0.98076923076923073

# 2.8. Validación

Hasta aquí se ha llevado a cabo un procedimiento muy informal de validación de modelos. En este apartado se van a validar los modelos de forma más rigurosa mediante validación cruzada. Se va a utilizar la clase GridSearchCV de sklearn, que permite entrenar y validar (con validación cruzada) modelos de manera automatizada.

In [107]:
from sklearn.model_selection import GridSearchCV

Ahora se construye un GridSearch para cada modelo.

In [108]:
lr_gs = GridSearchCV(estimator=LogisticRegression(),
                          param_grid={"penalty": ["l1","l2"], # Probamos con regularizaciones L1 y L2
                                      "C": [0.5, 1.0, 1.5],# Y varios valores de la regularización C
                                     "max_iter":[20,50,100],}, 
                            scoring="roc_auc", # En vez de accuracy, vamos a usar el área bajo la curva ROC
                            cv=5,
                            verbose=1)

knn_gs=GridSearchCV(estimator=KNeighborsClassifier(),
                   param_grid={"n_neighbors":[2,5,10]},
                    scoring="roc_auc",
                    cv=5,
                    verbose=1)
                              
svm_gs=GridSearchCV(estimator=SVC(),
                   param_grid={"C": [0.5, 1.0, 1.5],
                                  "kernel": ["linear","poly","rbf"],
                                  "degree": [2,3,4]},
                      scoring="roc_auc",
                      cv=5,
                      verbose=1)

lp_gs=GridSearchCV(estimator=MLPClassifier(),
                   param_grid={"activation":["logistic","tanh"],
                              "solver":["sgd","lbfgs"]},
                   scoring="roc_auc",
                   cv=5,
                   verbose=1)

arbol_gs=GridSearchCV(estimator=DecisionTreeClassifier(),
                        param_grid={"criterion":["gini","entropy"],
                                   "max_depth": [2,4,6],
                                   "min_samples_split": [2,5,10],
                                   "min_samples_leaf": [2,5,10]},
                     scoring="roc_auc",
                     cv=5,
                     verbose=1)

rf_gs=GridSearchCV(estimator=RandomForestClassifier(),
                  param_grid={"n_estimators": [5,20,50],
                                        "max_depth": [2,4,6],
                                       "min_samples_split": [3,5,10],
                                       "min_samples_leaf": [3,5,10]},
                  scoring="roc_auc",
                  cv=5,
                  verbose=1)

bag_modelos=[lr_gs,knn_gs,svm_gs,lp_gs,arbol_gs,rf_gs]

Se entrenan todos los modelos, se validan y se guarda el mejor modelo de cada tipo:

In [109]:
for modelo in bag_modelos:
    modelo.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   14.0s finished


Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   25.9s finished


Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:   56.9s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  6.9min finished


Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:   20.4s finished


Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  3.0min finished


In [110]:
lista_scores=[]
for modelo in bag_modelos:
    lista_scores.append(modelo.best_score_)

In [111]:
lista_scores

[0.99583089309258266,
 0.9966147718945132,
 0.99685574486046458,
 0.99798238382992799,
 0.98879966649654039,
 0.99832994017397725]

El modelo que mejor rendimiento ha dado ha sido el último, es decir, el Random Forests. Se observan ahora los parámetros del mejor modelo de esta categoría:

In [112]:
bag_modelos[5].best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=3, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Una vez que se ha determinado cuál es el mejor modelo se va a entrenar de manera convencional (sin GridSearchCV) usando los mejores hiperparámetros obtenidos:

In [113]:
modelo_ganador=RandomForestClassifier(bootstrap=True, 
                                      class_weight=None, 
                                      criterion='gini',
                                      max_depth=6,
                                      max_features='auto', 
                                      max_leaf_nodes=None,
                                      min_impurity_split=1e-07, 
                                      min_samples_leaf=3,
                                      min_samples_split=3, 
                                      min_weight_fraction_leaf=0.0,
                                      n_estimators=50, 
                                      n_jobs=1, 
                                      oob_score=False, 
                                      random_state=None,
                                      verbose=0, 
                                      warm_start=False)

In [114]:
modelo_ganador.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=entreno["color"])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=3, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Y ahora y solo ahora, al final del proceso, se emplea el subconjunto de datos de testing que se había separado en el primer apartado:

In [115]:
modelo_ganador.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']],y=testeo["color"])

0.98999999999999999

# 3. Regresión

In [152]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

In [148]:
ridge_gs = GridSearchCV(estimator=Ridge(),
                        param_grid={"alpha":[0.1,1,10,100]}, # Hiperparámetro de la regularización, que marca la intensidad del shrinking
                        verbose=1)

Lasso_gs = GridSearchCV(estimator=Lasso(),
                        param_grid={"alpha":[0.1,1,10,100]},
                        verbose=1)

knn_rg_gs = GridSearchCV(estimator=KNeighborsRegressor(),
                      param_grid={"n_neighbors":[2,5,10]},
                    verbose=1) 

rf_rg_gs =GridSearchCV(estimator=RandomForestRegressor(),
                     param_grid={"n_estimators": [5,20,50],
                                        "max_depth": [2,4,6],
                                       "min_samples_split": [3,5,10],
                                       "min_samples_leaf": [3,5,10]},
                  verbose=1)
              
svr_gs = GridSearchCV(estimator=SVR(),
                     param_grid={"C": [0.5, 1.0, 1.5],
                                  "kernel": ["linear","poly","rbf"],
                                  "degree": [2,3,4]},
                      verbose=1)

arbol_rg_gs = GridSearchCV(estimator=DecisionTreeRegressor(),
                                param_grid={"max_depth": [2,4,6],
                                   "min_samples_split": [2,5,10],
                                   "min_samples_leaf": [2,5,10]},
                    verbose=1)         
                             
bag_rg_modelos=[ridge_gs,Lasso_gs,knn_rg_gs,rf_rg_gs,svr_gs,arbol_rg_gs]

In [149]:
for model in bag_rg_modelos:
    model.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'color']],y=entreno["quality"])

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    0.5s finished


Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    0.8s finished


Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   12.8s finished


Fitting 3 folds for each of 81 candidates, totalling 243 fits


[Parallel(n_jobs=1)]: Done 243 out of 243 | elapsed:  2.7min finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed:  9.1min finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed:    5.2s finished


In [150]:
lista_rg_scores=[]
for model in bag_rg_modelos:
    lista_rg_scores.append(model.best_score_)

In [151]:
lista_rg_scores

[0.29062563797793906,
 0.23744677080708773,
 0.34689816375458377,
 0.36543088473252189,
 0.39723683129587639,
 0.28470365870543274]

El mejor resultado es para el quinto modelo, SVR, que ha obtenido un R2 de 0.3972.

In [158]:
bag_rg_modelos[4].best_estimator_

SVR(C=1.5, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [159]:
modelo_ganador_rg=SVR(C=1.5, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [160]:
modelo_ganador_rg.fit(X=entreno[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'color']],y=entreno["quality"])

SVR(C=1.5, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [162]:
modelo_ganador_rg.score(X=testeo[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'color']],y=testeo["quality"])

0.38086443624290056