# Ejercicio validación

Para el siguiente ejemplo, se solicita importar los datos del dataset `student-mat.csv` y realizar el proceso de validación utilizando los siguientes mecanismos 

- Validación Hold-out
- Validación Cruzada

>Note: Para las predicciones utilizar [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). 

In [6]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.model_selection import KFold
import warnings
warnings.filterwarnings('ignore')

## Importar datos

In [2]:
df = pd.read_csv('student-mat.csv', sep=';') #TODO
df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,5,5,4,4,5,4,11,9,9,9
391,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,3,14,16,16
392,MS,M,21,R,GT3,T,1,1,other,other,...,5,5,3,3,3,3,3,10,8,7
393,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,0,11,12,10


In [3]:
df = df.select_dtypes(include=['number']) #TODO - Utilizar columnas numéricas
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   age         395 non-null    int64
 1   Medu        395 non-null    int64
 2   Fedu        395 non-null    int64
 3   traveltime  395 non-null    int64
 4   studytime   395 non-null    int64
 5   failures    395 non-null    int64
 6   famrel      395 non-null    int64
 7   freetime    395 non-null    int64
 8   goout       395 non-null    int64
 9   Dalc        395 non-null    int64
 10  Walc        395 non-null    int64
 11  health      395 non-null    int64
 12  absences    395 non-null    int64
 13  G1          395 non-null    int64
 14  G2          395 non-null    int64
 15  G3          395 non-null    int64
dtypes: int64(16)
memory usage: 49.5 KB


## Validación Hold-out

In [4]:
# Test: hold-out split 80-20%. # Partición externa
X_training, X_test, y_training, y_test = train_test_split(df.iloc[:,:-1], df['G3'], test_size=0.20, random_state=42) #TODO

valores_test, ocur_test = np.unique(y_test, return_counts=True)

print('Test: ', 'clases:', valores_test)
print('\n')
print('ocurrencias: ', ocur_test)

Test:  clases: [ 0  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


ocurrencias:  [ 5  4  6  1  6  5 11  5  5  5  6 10  4  3  1  2]


In [7]:
# Estandarizar las características de entrenamiento y de test
standardizer = preprocessing.StandardScaler() #TODO
X_training = standardizer.fit_transform(X_training) #TODO
X_test = standardizer.transform(X_test) #TODO

print(X_training)
print(X_test)

[[-0.58639605  0.24643712  0.42320737 ... -0.46440769  0.33205033
   0.62616324]
 [-0.58639605 -0.68063585  0.42320737 ... -0.70225668  0.64340909
   0.89283114]
 [-0.58639605 -1.60770883  0.42320737 ... -0.70225668 -0.91338472
  -0.97384417]
 ...
 [ 1.77915057  0.24643712  0.42320737 ...  1.08161077 -0.60202596
  -0.44050837]
 [ 0.20211949  1.17351009  0.42320737 ... -0.70225668  0.64340909
   1.15949904]
 [-1.37491159  1.17351009  1.35191243 ... -0.22655869 -0.2906672
   0.62616324]]
[[ 0.20211949 -0.68063585 -1.43420275 ... -0.46440769 -0.91338472
  -0.70717627]
 [ 0.99063503 -1.60770883 -0.50549769 ... -0.34548319  0.95476785
   0.35949533]
 [ 0.99063503  0.24643712  0.42320737 ...  0.24913929 -2.47017853
  -1.50717997]
 ...
 [-1.37491159 -0.68063585 -1.43420275 ...  0.24913929 -0.60202596
  -0.44050837]
 [-1.37491159  1.17351009 -0.50549769 ... -0.46440769  1.26612661
   0.89283114]
 [ 0.20211949  0.24643712 -0.50549769 ...  1.20053527 -1.53610225
  -1.50717997]]


In [8]:
# Validación: hold-out split 80-20%. # Partición interna
X_train, X_val, y_train, y_val = train_test_split(X_training, y_training, test_size = 0.2, random_state = 42) #TODO

valores_train, ocur_train = np.unique(y_train, return_counts=True)
print('Entrenamiento: ', ' clases:', valores_train)
print('\n')
print('ocurrencias:', ocur_train)
print('\n')


valores_val, ocur_val = np.unique(y_val, return_counts=True)

print('Test: ', 'clases:', valores_val)
print('\n')
print('ocurrencias: ', ocur_val)

Entrenamiento:   clases: [ 0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]


ocurrencias: [27  1  3  6  5 17 16 34 30 22 23 18 21 12  3 11  2  1]


Test:  clases: [ 0  6  7  8  9 10 11 12 13 14 15 19]


ocurrencias:  [ 6  3  3  9  7 11 12  4  3  3  2  1]


In [10]:
# Construcción del objeto que contiene el algoritmo de aprendizaje.
# Utilizamos un algoritmo de regresión lineal
lm = LinearRegression() #TODO

In [11]:
lm.fit(X_train, y_train) # TODO - Entrenamos modelo

val_accuracy = lm.score(X_val, y_val) # TODO - Evaluamos modelo en validación
print('Exactitud en validación ', np.round(val_accuracy*100, 2), '%')

test_accuracy = lm.score(X_test, y_test) # TODO - Evaluamos modelo en test
print('Exactitud en test ', np.round(test_accuracy*100, 2), '%')

Exactitud en validación  83.56 %
Exactitud en test  77.68 %


## Validación Cruzada

In [12]:
# Test: validación cruzada split 80-20%. PARTICIÓN EXTERNA
X_training, X_test, y_training, y_test = train_test_split(df.iloc[:,:-1], df['G3'], test_size=0.20, random_state=42) #TODO

valores_test, ocur_test = np.unique(y_test, return_counts=True)
print('Test: ', 'clases:', valores_test, ' ocurrencias: ', ocur_test)

valores_train, ocur_train = np.unique(y_train, return_counts=True)
print('Entrenamiento: ', ' clases:', valores_train, '  ocurrencias:', ocur_train)

Test:  clases: [ 0  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]  ocurrencias:  [ 5  4  6  1  6  5 11  5  5  5  6 10  4  3  1  2]
Entrenamiento:   clases: [ 0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]   ocurrencias: [27  1  3  6  5 17 16 34 30 22 23 18 21 12  3 11  2  1]


In [14]:
# Validación: validación cruzada split 80-20%. # Partición interna
X_train, X_val, y_train, y_val = train_test_split(X_training, y_training, test_size = 0.2, random_state = 42) #TODO

valores_train, ocur_train = np.unique(y_train, return_counts=True)
print('Entrenamiento: ', ' clases:', valores_train)
print('\n')
print('ocurrencias:', ocur_train)
print('\n')


valores_val, ocur_val = np.unique(y_val, return_counts=True)

print('Test: ', 'clases:', valores_val)
print('\n')
print('ocurrencias: ', ocur_val)

Entrenamiento:   clases: [ 0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]


ocurrencias: [27  1  3  6  5 17 16 34 30 22 23 18 21 12  3 11  2  1]


Test:  clases: [ 0  6  7  8  9 10 11 12 13 14 15 19]


ocurrencias:  [ 6  3  3  9  7 11 12  4  3  3  2  1]


In [15]:
# Construcción del objeto que contiene el algoritmo de aprendizaje.
# Utilizamos un algoritmo de regresión lineal
lm = LinearRegression() #TODO

In [17]:
# Validación, entrenamiento y evaluación del algoritmo de aprendizaje.
# en "cv = KFold(n_splits=5)" estamos haciendo un cross-validation INTERNO!

"""
En results se hace la validación cruzada
"""

#TODO
results = cross_val_score(lm, 
                          X_train, 
                          y_train, 
                          cv = KFold(n_splits=5, shuffle = True, random_state = 42))

print("Resultados por bolsa: ", results)
print("Accuracy (media +/- desv.): %0.4f +/- %0.4f" % (results.mean(), results.std()))

Resultados por bolsa:  [0.8241974  0.81931996 0.778666   0.7762702  0.86397863]
Accuracy (media +/- desv.): 0.8125 +/- 0.0325


In [25]:
# Una vez entrenado y validado el modelo para seleccionar los mejores hyperparameters, 
# utilizamos todos los datos de "train" y "val" para entrenar el modelo definitivo

lm = lm.fit(X_training, y_training) # TODO - Entrenamiento
test_results = lm.score(X_test, y_test) # TODO - Evaluación en test
print('Exactitud en test: ', test_results*100, '%')
print('\n')
# También podemos extraer las predicciones, en lugar de directamente la accuracy
y_pred = lm.predict(X_test) #TODO
y=pd.Series(y_pred, index = y_test.index)
print('Predicciones:\n', y)
print('\n')
print('Etiquetas reales:\n', y_test)

Exactitud en test:  78.0358021376833 %


Predicciones:
 78      5.971772
371    12.063020
248     3.155269
55      9.135405
390     7.689242
         ...    
364    10.190736
82      6.088894
114     9.406754
3      14.233254
18      4.204178
Length: 79, dtype: float64


Etiquetas reales:
 78     10
371    12
248     5
55     10
390     9
       ..
364    12
82      6
114     9
3      15
18      5
Name: G3, Length: 79, dtype: int64
