<a href="https://colab.research.google.com/github/Dr-Carlos-Villasenor/PatternRecognition/blob/main/PR08_01_CrossVal_HyperSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reconocimeinto de Patrones
## Dr. Carlos Vilaseñor
## Validación cruzada y búsqueda de hiperparámetros



1. Importamos las bibliotecas necesarias

In [1]:
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split

2. Cargamos el conjunto de datos

In [None]:
!wget https://raw.githubusercontent.com/Dr-Carlos-Villasenor/PatternRecognition/main/Dataset/loan_prediction.csv

In [3]:
df = pd.read_csv('loan_prediction.csv')

In [None]:
df.head()

3. Seleccionamos las variables y particionamos

In [5]:
x = np.asanyarray(df.iloc[:,0:-1])
y = np.asanyarray(df.iloc[:,-1])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25, random_state=0)

In [None]:
print('x:', x.shape, 'y:',y.shape)
print('xtrain:', xtrain.shape, 'ytrain:',ytrain.shape)
print('xtest:', xtest.shape, 'ytest:',ytest.shape)

4. Creamos las particiones de validación cruzada usando Kfold

In [None]:
#from sklearn.model_selection import KFold
kf = KFold(n_splits=5)

train_scores = []
dev_scores = []

for train_index, test_index in kf.split(xtrain):
  train, dev = xtrain[train_index], xtrain[test_index]
  y_train, y_dev = ytrain[train_index], ytrain[test_index]
  model = tree.DecisionTreeClassifier()
  model.fit(train, y_train)
  train_scores.append(model.score(train,y_train))
  dev_scores.append(model.score(dev, y_dev))


print(train_scores)
print(np.mean(train_scores))
print(dev_scores)
print(np.mean(dev_scores))

Otra variante es la versión estratificada donde se busca un balance de las clases en cada partición

In [None]:
skf = StratifiedKFold(n_splits=5)

for i, (train_index, test_index) in enumerate(skf.split(xtrain, ytrain)):
  print(f"Fold {i}:")
  print(f"  Train: index={train_index}")
  print(f"  Test:  index={test_index}")

4. Entrenar usando validación cruzada

In [None]:
model = tree.DecisionTreeClassifier()
scores = cross_val_score(model, xtrain, ytrain, cv=5, scoring='f1_macro')
print(scores)

5. Calculamos la media como metrica de desempeño final

In [10]:
scores.mean()

0.6206294454799657

6. Para la búsqueda de hiperparámetros creamos el siguiente diccionario

In [11]:
parameters = {'max_depth':[1,2,3,4,5],
              'min_samples_leaf':[1,2,3,4,5],
              'min_samples_split':[2,3,4,5],
              'criterion' : ['gini','entropy']}

7. Hacemos una búsquada usando gridsearch

In [None]:
model = tree.DecisionTreeClassifier()
#search_obj = GridSearchCV(model, parameters, cv=5, scoring='f1_macro')
search_obj = RandomizedSearchCV(model, parameters, n_iter=10, cv=5, scoring='f1_macro')
fit_obj = search_obj.fit(xtrain, ytrain)
print(fit_obj.cv_results_['mean_test_score'])

8. Imprimimos el mejor modelo

In [None]:
best_model = fit_obj.best_estimator_
print(best_model)

9. Reentrenamos el mejor modelo con todos los datos del entrenamiento

In [None]:
best_model.fit(xtrain, ytrain)
# Make predictions using the new model.
print('Train: ', best_model.score(xtrain, ytrain))
print('Test: ', best_model.score(xtest, ytest))