<a href="https://colab.research.google.com/github/TomasPastore/aprendizaje_automatico_tp/blob/main/tp_AA_grupo_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import timeit
from sklearn.metrics import roc_auc_score

# Lectura del dataset

In [None]:
url = 'https://raw.githubusercontent.com/aprendizaje-automatico-dc-uba-ar/material/main/tp/01_aprendizaje_supervisado/datos/minions_publicos.csv'
minions_dataset = pd.read_csv(url)

y = minions_dataset.iloc[:,-1:]
print(y[y["target"] == 1])

del minions_dataset[minions_dataset.columns[-1]]
print("\n Features")
minions_dataset

Esto nos indica que hay 154 de clase positiva y el resto de clase negativa. Es decir, está desbalanceado en aproximadamente 30/70. Tenemos que tener en cuenta entonces esto a la hora de splitear los datos.

# 1) Separación de datos

Para la separación de datos de evaluación, lo primero que tuvimos en cuenta fue evaluar la distribucion de etiquetas de los minions entrevistados. Al revisar esto notamos que hay un desbalance entre minions que fueron aceptados vs aquellos que no (aproximadamente fueron aceptados el 30% de los postulantes). Por lo tanto es primordial que mantengamos la proporción entre estas dos clases cuando hagamos los _k-folds_, ya que de no realizarlo nuestro predictor no se entrenaría con las proporciones correctas de los datos. 

En segundo lugar tenemos que tener en cuenta la cantidad de candidatos que tenemos. Solamente 500 postulados resulta una base de datos más bien reducida, por lo que no tenemos margen a la hora de decidir nuestra estrategia a la hora de separación de datos. Lo que nuestro grupo propone es realizar una separación del 10% para utilizar en la etapa de evaluación final, manteniendo el 90% restante para el desarrollo de nuestros modelos (todo esto haciendo _stratified split_ para que el desbalance de datos no nos perjudique a futuro).

#########################
Revisar juntos, cuando hacemos el train_test_split se elije al azar, se mantiene las proporciones de las clases pero se hace al azar, no estamos considerando el desbalanceo. mas datos para eval??

In [None]:
from sklearn.model_selection import train_test_split

X_dev, X_eval, y_dev, y_eval = train_test_split(minions_dataset, y, shuffle=True, random_state=4, test_size=0.1) # quedan 15 positivos para eval

print(f"X_train dimensión: {X_dev.shape}")
print(f"y_train dimensión: {y_dev.shape}")

print(f"X_test dimensión: {X_eval.shape}")
print(f"y_test dimensión: {y_eval.shape}")


# 2) Construcción de modelos

## 2.1) Árbol default con max_height 3

In [None]:
from sklearn.tree import DecisionTreeClassifier

arbol_gini_3 = DecisionTreeClassifier(max_depth=3)

## 2.2) Iteradores de cross validation

In [None]:

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold

# consultamos y podemos usar stratifieldk fold, sacaria el otro creo para no complejizar pero eventualmente podriamos comparar los.
# evaluar que tanto conviene entrenar balanceado vs entrenar con las proporciones reales.
balanced_k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Validar si podemos usar esto porque el enunciado dice k fold
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=4)

X_train_cv = []
y_train_cv = []
X_test_cv = []
y_test_cv = []

for train_index, test_index in sss.split(X_dev, y_dev):
    X_train = X_dev.iloc[train_index]
    X_test = X_dev.iloc[test_index]
    y_train = y_dev.iloc[train_index]
    y_test = y_dev.iloc[test_index]
    
    X_train_cv.append(X_train)
    X_test_cv.append(X_test)
    y_train_cv.append(y_train)
    y_test_cv.append(y_test)


## 2.2) Métricas

In [None]:
from sklearn import metrics

# Scoring metrics

def tn(y, y_pred): return metrics.confusion_matrix(y, y_pred)[0, 0];
def fp(y, y_pred): return metrics.confusion_matrix(y, y_pred)[0, 1];
def fn(y, y_pred): return metrics.confusion_matrix(y, y_pred)[1, 0];
def tp(y, y_pred): return metrics.confusion_matrix(y, y_pred)[1, 1];
def specificity(y, y_pred): return tn(y, y_pred)/(tn(y, y_pred) + fp(y,y_pred));
def precision(y, y_pred): return tp(y, y_pred)/(tp(y, y_pred) + fp(y, y_pred));
def recall(y, y_pred): return tp(y, y_pred)/(tp(y, y_pred) + fn(y, y_pred));
def f1(y, y_pred): return 2 * precision(y, y_pred) * recall(y, y_pred) / (precision(y, y_pred) + recall(y_pred))
def accuracy(y,y_pred): return metrics.accuracy_score(y, y_pred) # creo q se puede borrar, ver q no se use

scoring = {#'precision': metrics.make_scorer(precision), 
           #'recall': metrics.make_scorer(recall),
           #'specificity': metrics.make_scorer(specificity),
           #'tp': metrics.make_scorer(tp),
           #'tn': metrics.make_scorer(tn),
           #'fp': metrics.make_scorer(fp),
           #'fn': metrics.make_scorer(fn),
           #'f1_score': metrics.make_scorer(f1),
           'roc_auc': metrics.make_scorer(metrics.roc_auc_score, needs_proba=True),
           'accuracy': metrics.make_scorer(metrics.accuracy_score)
           }


## 2.2) Árbol default (gini) altura 3 + Crossvalidation K=5 + Accuracy y AUC 

Integramos las 3 celdas anteriores para correr los folds y calcular las metricas para un classificador

In [None]:
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline

# Crossvalidation with one fixed configuration of hyperparameters.
# scoring is a dictionary with the metrics we want to calculate
# cv is a cross validation iterator

def cross_validate_uba(clf, X, y, scoring, cv, score_train=True):
    n = len(y)
    k = cv.n_splits

    train_global_preds = np.zeros( (n, k-1), dtype=np.ndarray) # train samples will be scored k-1 times
    train_global_probas = np.zeros((n, k-1), dtype=np.ndarray)  # we will save only the proba of POSITIVE class
    test_global_preds = np.zeros(n)
    test_global_probas = np.zeros(n)
    
    cv_results = dict(train_global_preds=train_global_preds,
                      train_global_probas=train_global_probas,
                      train_folds_accuracy=np.empty(0),
                      train_folds_auc=np.empty(0), 
                      test_global_preds=test_global_preds,
                      test_global_probas=test_global_probas,
                      test_folds_accuracy=np.empty(0),
                      test_folds_auc=np.empty(0),
                      ) 

    for fold_idx, (train_idxs, test_idxs) in enumerate(cv.split(X, y)):
        X_train = X[train_idxs]
        y_train = y[train_idxs]
        X_test = X[test_idxs]
        y_test = y[test_idxs]
      
        clf.fit(X_train, y_train)

        if score_train:
            # Scores sobre train
            y_pred_train = clf.predict(X_train)
            y_proba_train = clf.predict_proba(X_train)
            cv_results["train_global_preds"][train_idxs, fold_idx] = y_pred_train 
            cv_results["train_global_probas"][train_idxs, fold_idx] = y_proba_train

        # Scores sobre test
        y_pred_test = clf.predict(X_test)
        y_proba_test = clf.predict_proba(X_test)
        cv_results["test_global_preds"][test_idxs] = y_pred_test
        cv_results["test_global_accuracy"][test_idxs] = y_proba_test
    

    return cv_results


def cv_with_metrics(classifier_to_train, X, y, scoring, cv, score_train=False):
    models_to_standarize = {'Knn', 'SVM'}
    if classifier_to_train.__name__ in models_to_standarize:
      clf = make_pipeline(preprocessing.StandardScaler(), clf)
    else: 
      clf = classifier_to_train

    cv_results = cross_validate_uba(clf, X, y=y, scoring=scoring, cv=cv, score_train=score_train)

    print(f'Crossvalidation metrics for {clf}...\n')
    
    # Ver si tiene sentido usar repeated k fold para calcular intervalos de confianza de las metricas

    if score_train:
      print(f"""Accuracy [TRAIN]:
                \tGlobal --> {cv_results['train_global_accuracy']:.3f}\n
                \tFolds: 
                \ttRaw --> {list(map(lambda x: round(x, 3), cv_results['train_folds_accuracy']))}\n
                \t\tMean --> {np.mean(cv_results['train_folds_accuracy']):.3f}\n
                \t\tSTD --> {np.std(cv_results['train_folds_accuracy']):.3f}\n 
                \t\tMedian --> {np.median(cv_results['train_folds_accuracy']):.3f}""")

    print(f"""Accuracy [TEST]:
          \tGlobal --> {cv_results['test_global_accuracy']:.3f}\n
          \tFolds: 
          \ttRaw --> {list(map(lambda x: round(x, 3), cv_results['test_folds_accuracy']))}\n
          \t\tMean --> {np.mean(cv_results['test_folds_accuracy']):.3f}\n
          \t\tSTD --> {np.std(cv_results['test_folds_accuracy']):.3f}\n 
          \t\tMedian --> {np.median(cv_results['test_folds_accuracy']):.3f}""")

    if score_train:
      print(f"""AUC:
                \tGlobal --> {cv_results['train_global_auc']:.3f}\n
                \tFolds: 
                \ttRaw --> {list(map(lambda x: round(x, 3), cv_results['train_folds_auc']))}\n
                \t\tMean --> {np.mean(cv_results['train_folds_auc']):.3f}\n
                \t\tSTD --> {np.std(cv_results['train_folds_auc']):.3f}\n 
                \t\tMedian --> {np.median(cv_results['train_folds_auc']):.3f}""")
      
    print(f"""AUC:
              \tGlobal --> {cv_results['train_global_auc']:.3f}\n
              \tFolds: 
              \ttRaw --> {list(map(lambda x: round(x, 3), cv_results['train_folds_auc']))}\n
              \t\tMean --> {np.mean(cv_results['train_folds_auc']):.3f}\n
              \t\tSTD --> {np.std(cv_results['train_folds_auc']):.3f}\n 
              \t\tMedian --> {np.median(cv_results['train_folds_auc']):.3f}""")


In [None]:
arr = np.zeros((10, 5), dtype=np.ndarray)

arr[[0,2,4,6,8],3] = [1,2,3,4,5]
arr



In [None]:
"# Corremos el K fold cross validation para una configuracion fija
cv_with_metrics(arbol_gini_3, X_dev, y_dev, scoring, cv=balanced_k_fold, train_score=True)

## Resultados

<table>
      <thead>
      <tr>
      <th align="center">Permutación</th>
      <th>Accuracy (training)</th>
      <th>Accuracy (validación)</th>
      <th>AUC ROC (training)</th>
      <th>AUC ROC (validación)</th>
      </tr>
      </thead>
      <tbody>
      <tr>
      <td align="center">1</td>
      <td>0,769</td>
      <td>0.722</td>
      <td>0.815</td>
      <td>0.759</td>
      </tr>
      <tr>
      <td align="center">2</td>
      <td>0.806</td>
      <td>0.667</td>
      <td>0.774</td>
      <td>0.622</td>
      </tr>
      <tr>
      <td align="center">3</td>
      <td>0.841</td>
      <td>0.689</td>
      <td>0.837</td>
      <td>0.614</td>
      </tr>
      <tr>
      <td align="center">4</td>
      <td>0.786</td>
      <td>0.6</td>
      <td>0.757</td>
      <td>0.493</td>
      </tr>
      <tr>
      <td align="center">5</td>
      <td>0.839</td>
      <td>0.711</td>
      <td>0.841</td>
      <td>0.655</td>
      </tr>
      <tr>
      <td align="center">Global</td>
      <td>0.678</td>
      <td>0.808</td>
      <td>0.817</td>
      <td>0.629</td>
      </tr>
      </tbody>
      </table>

#2.3) Parameter grid sobre el arbol

In [None]:
from sklearn.model_selection import ParameterGrid

def parameter_grid_search(classifier, grid):
    param_grid = ParameterGrid(grid)
    for config in param_grid:
        classifier.set_params(**config)
        cv_with_metrics(classifier, X_dev, y_dev, scoring, cv=balanced_k_fold)


In [None]:
grid_arbol = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5]
}
parameter_grid_search(classifier=arbol_gini_3, grid=grid_arbol)

## Resultados

<table>
   <thead>
   <tr>
   <th align="center">Altura máxima</th>
   <th align="center">Criterio de corte</th>
   <th>Accuracy (training)</th>
   <th>Accuracy (validación)</th>
   </tr>
   </thead>
   <tbody><tr>
   <td align="center">3</td>
   <td align="center">Gini</td>
   <td>0.8</td>
   <td>0.67</td>
   </tr>
   <tr>
   <td align="center">5</td>
   <td align="center">Gini</td>
   <td>0.91</td>
   <td>0.67</td>
   </tr>
   <tr>
   <td align="center">Infinito</td>
   <td align="center">Gini</td>
   <td>1.0</td>
   <td>0.67</td>
   </tr>
   <tr>
   <td align="center">3</td>
   <td align="center">Entropía</td>
   <td>0.78</td>
   <td>0.64</td>
   </tr>
   <tr>
   <td align="center">5</td>
   <td align="center">Entropía</td>
   <td>0.89</td>
   <td>0.64</td>
   </tr>
   <tr>
   <td align="center">Infinito</td>
   <td align="center">Entropía</td>
   <td>1.0</td>
   <td>0.63</td>
   </tr>
   </tbody></table>

# 2.2 y 2.3) v0

In [None]:
arboles_cv = []
y_pred_cv = []
AAA = 0
BBB = 0
CCC = 0
DDD = 0

for i in range(5):
  arbol = DecisionTreeClassifier(max_depth=3)
  arbol.fit(X_train_cv[i], y_train_cv[i])
  arboles_cv.append(arbol)

  # predecimos los valores para las instacias que no vimos
  y_pred = arbol.predict(X_test_cv[i])
  y_pred_cv.append(y_pred)
  y_pred_test = arbol.predict(X_train_cv[i])
  y_pred_cv.append(y_pred_test)

  # print(y_pred)


  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el train set: {arbol.score(X_train_cv[i], y_train_cv[i])}")


  # REVISAR, creo que estan al reves los parametros https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
  # y que hay que usarla con predict_proba 
  auc_train = roc_auc_score(y_pred_test, y_train_cv[i])
  print("AUC de la curva ROC sobre el train set:", auc_train)



  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el test set: {arbol.score(X_test_cv[i], y_test_cv[i])}")

  auc_test = roc_auc_score(y_pred, y_test_cv[i])
  print("AUC de la curva ROC sobre el test set:", auc_test)


  AAA += arbol.score(X_test_cv[i], y_test_cv[i])
  BBB += auc_test
  CCC += arbol.score(X_train_cv[i], y_train_cv[i])
  DDD += auc_train


print("accuracy promedio de test:", AAA/5)
print("AUC promedio de test:", BBB/5)
print("accuracy promedio de train:", CCC/5)
print("AUC promedio de train:", DDD/5)



**Resultados**<table>
      <thead>
      <tr>
      <th align="center">Permutación</th>
      <th>Accuracy (training)</th>
      <th>Accuracy (validación)</th>
      <th>AUC ROC (training)</th>
      <th>AUC ROC (validación)</th>
      </tr>
      </thead>
      <tbody>
      <tr>
      <td align="center">1</td>
      <td>0,769</td>
      <td>0.722</td>
      <td>0.815</td>
      <td>0.759</td>
      </tr>
      <tr>
      <td align="center">2</td>
      <td>0.806</td>
      <td>0.667</td>
      <td>0.774</td>
      <td>0.622</td>
      </tr>
      <tr>
      <td align="center">3</td>
      <td>0.841</td>
      <td>0.689</td>
      <td>0.837</td>
      <td>0.614</td>
      </tr>
      <tr>
      <td align="center">4</td>
      <td>0.786</td>
      <td>0.6</td>
      <td>0.757</td>
      <td>0.493</td>
      </tr>
      <tr>
      <td align="center">5</td>
      <td>0.839</td>
      <td>0.711</td>
      <td>0.841</td>
      <td>0.655</td>
      </tr>
      <tr>
      <td align="center">Global</td>
      <td>0.678</td>
      <td>0.808</td>
      <td>0.817</td>
      <td>0.629</td>
      </tr>
      </tbody>
      </table>

In [None]:
arboles_cv = []
y_pred_cv = []
AAA = 0
CCC = 0

for i in range(5):
  arbol = DecisionTreeClassifier(max_depth=5)
  arbol.fit(X_train_cv[i], y_train_cv[i])
  arboles_cv.append(arbol)

  # predecimos los valores para las instacias que no vimos
  y_pred = arbol.predict(X_test_cv[i])
  y_pred_cv.append(y_pred)
  # print(y_pred)

  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el test set: {arbol.score(X_test_cv[i], y_test_cv[i])}")

  AAA += arbol.score(X_test_cv[i], y_test_cv[i])
  CCC += arbol.score(X_train_cv[i], y_train_cv[i])

print("accuracy promedio de test:", AAA/5)
print("accuracy promedio de train:", CCC/5)

In [None]:
arboles_cv = []
y_pred_cv = []
AAA = 0
CCC = 0

for i in range(5):
  arbol = DecisionTreeClassifier()
  arbol.fit(X_train_cv[i], y_train_cv[i])
  arboles_cv.append(arbol)

  # predecimos los valores para las instacias que no vimos
  y_pred = arbol.predict(X_test_cv[i])
  y_pred_cv.append(y_pred)
  # print(y_pred)

  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el test set: {arbol.score(X_test_cv[i], y_test_cv[i])}")

  AAA += arbol.score(X_test_cv[i], y_test_cv[i])
  CCC += arbol.score(X_train_cv[i], y_train_cv[i])

print("accuracy promedio de test:", AAA/5)
print("accuracy promedio de train:", CCC/5)

In [None]:
arboles_cv = []
y_pred_cv = []
AAA = 0
CCC = 0

for i in range(5):
  arbol = DecisionTreeClassifier(criterion='entropy', max_depth=3)
  arbol.fit(X_train_cv[i], y_train_cv[i])
  arboles_cv.append(arbol)

  # predecimos los valores para las instacias que no vimos
  y_pred = arbol.predict(X_test_cv[i])
  y_pred_cv.append(y_pred)
  # print(y_pred)

  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el test set: {arbol.score(X_test_cv[i], y_test_cv[i])}")

  AAA += arbol.score(X_test_cv[i], y_test_cv[i])
  CCC += arbol.score(X_train_cv[i], y_train_cv[i])

print("accuracy promedio de test:", AAA/5)
print("accuracy promedio de train:", CCC/5)

In [None]:
arboles_cv = []
y_pred_cv = []
AAA = 0
CCC = 0

for i in range(5):
  arbol = DecisionTreeClassifier(criterion='entropy', max_depth=5)
  arbol.fit(X_train_cv[i], y_train_cv[i])
  arboles_cv.append(arbol)

  # predecimos los valores para las instacias que no vimos
  y_pred = arbol.predict(X_test_cv[i])
  y_pred_cv.append(y_pred)
  # print(y_pred)

  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el test set: {arbol.score(X_test_cv[i], y_test_cv[i])}")

  AAA += arbol.score(X_test_cv[i], y_test_cv[i])
  CCC += arbol.score(X_train_cv[i], y_train_cv[i])

print("accuracy promedio de test:", AAA/5)
print("accuracy promedio de train:", CCC/5)

In [None]:
arboles_cv = []
y_pred_cv = []
AAA = 0
CCC = 0

for i in range(5):
  arbol = DecisionTreeClassifier(criterion='entropy')
  arbol.fit(X_train_cv[i], y_train_cv[i])
  arboles_cv.append(arbol)

  # predecimos los valores para las instacias que no vimos
  y_pred = arbol.predict(X_test_cv[i])
  y_pred_cv.append(y_pred)
  # print(y_pred)

  #print(f"Accuracy sobre el test set: {np.mean(y_pred == y_eval)}") 
  print(f"Accuracy sobre el test set: {arbol.score(X_test_cv[i], y_test_cv[i])}")

  AAA += arbol.score(X_test_cv[i], y_test_cv[i])
  CCC += arbol.score(X_train_cv[i], y_train_cv[i])

print("accuracy promedio de test:", AAA/5)
print("accuracy promedio de train:", CCC/5)

<table>
   <thead>
   <tr>
   <th align="center">Altura máxima</th>
   <th align="center">Criterio de corte</th>
   <th>Accuracy (training)</th>
   <th>Accuracy (validación)</th>
   </tr>
   </thead>
   <tbody><tr>
   <td align="center">3</td>
   <td align="center">Gini</td>
   <td>0.8</td>
   <td>0.67</td>
   </tr>
   <tr>
   <td align="center">5</td>
   <td align="center">Gini</td>
   <td>0.91</td>
   <td>0.67</td>
   </tr>
   <tr>
   <td align="center">Infinito</td>
   <td align="center">Gini</td>
   <td>1.0</td>
   <td>0.67</td>
   </tr>
   <tr>
   <td align="center">3</td>
   <td align="center">Entropía</td>
   <td>0.78</td>
   <td>0.64</td>
   </tr>
   <tr>
   <td align="center">5</td>
   <td align="center">Entropía</td>
   <td>0.89</td>
   <td>0.64</td>
   </tr>
   <tr>
   <td align="center">Infinito</td>
   <td align="center">Entropía</td>
   <td>1.0</td>
   <td>0.63</td>
   </tr>
   </tbody></table>

conclusiones: (a checkear si estan bien calculadas las métricas, siento que tendrian que dar otra cosa) Podemos observar que un aumento en la altura máxima de los arboles no se condice con una mejora en la performance en la etapa de validacion. Esto se debe a que si los arboles tienen altura infinita van a asegurarse de que podamos clasificar correctamente a todas las instancias de train (por eso accuracy de train aumenta a medida que incrementamos la altura máxima); sin embargo esto nos deja en un claro caso de overfitting, lo cual se evidencia al ver que a pesar de una mejora sustancial en la accuracy de train, la accuracy de validation no solo no mejora sino que empeora. 
A modo de conclusion podemos afirmar que aumentar la altura maxima de los arboles no solo va a resultar mas costoso computacionalmente sino que va a terminar por empeorar nuestro algoritmo.

# 3) Comparación de algoritmos con RandomizedSearchCV

Algoritmos a probar:

* Árboles de decisión (esto ya lo hicimos recien, buscamos mejores arboles con hiperparametros? Si, aca probamos mas hp)
* KNN (k-vecinos más cercanos)
* SVM (Support vector machine)
* LDA (Linear discriminant analysis)
* Naïve Bayes

podemos tomar el test score global como el promedio de cada uno de los folds (esta mal pero no tan mal)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline

def fine_tune(X, y, classifier, model_name, grid, cv, scoring, objective):
    print('\nFine tuning {0}. Objective: {1}'.format(model_name, objective))

    pipeline_clf = Pipeline([
        ('scaler', preprocessing.StandardScaler()),
        ('clf', classifier)
        ])

    # agrego lo de standarizar, creo que esto hizo que de un poco peor, ver si lo hacemos o no 
    random_search = RandomizedSearchCV(estimator=pipeline_clf, param_distributions=grid, n_jobs=-1, cv=cv, scoring=scoring, refit=objective)
    random_result = random_search.fit(X, y)
    print("Best score was : %f using %s" % (random_result.best_score_, random_result.best_params_))
    return random_result.best_estimator_
      

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from scipy.stats import uniform, randint

# Se iniciailizan default porque despues se van variando los hiperparámetros
classifiers_to_test = {
    'Árbol de decisión': DecisionTreeClassifier(),
    'KNN': KNeighborsClassifier(n_jobs=-1),
    'SVM' : SVC(),
    'LDA' : LDA(),
    'Naïve Bayes': GaussianNB()
}

grids = dict()

grids['Árbol de decisión'] = {
    'clf__criterion': ['gini', 'entropy', 'log_loss'],
    'clf__splitter': ['best', 'random'],
    'clf__max_depth': [None, 10, 50, 100, 150, 200],
    'clf__max_features': ['sqrt', 'log2', None]
}
grids['KNN'] = {
    'clf__n_neighbors': range(5, 26, 5),
    'clf__weights': ['uniform', 'distance'],
    'clf__leaf_size': randint(20, 50),
    'clf__p': [1, 2]
}
grids['SVM'] = {
    'clf__C': [0.5, 1, 2],
    'clf__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'clf__degree': [1, 2, 3]
}
grids['LDA'] = {
    'clf__solver': ['lsqr', 'eigen'],
    'clf__shrinkage': [None, 'auto', 0, 0.25, 0.5, 0.75, 1]
}
grids['Naïve Bayes'] = {
    'clf__var_smoothing': uniform(1e-9, 1e-7)
}

# TODO Ver estos warnings jeje
import warnings
warnings.filterwarnings("ignore")

best_classifier = dict()
for clf_name, clf in classifiers_to_test.items():
    # fine_tune devuelve el estimador con la mejor combinación entre las que se prueban, en el dict guardamos la mejor config para cada algoritmo
    best_classifier[clf_name] = fine_tune(X_dev, y_dev, clf, clf_name, grids[clf_name], balanced_k_fold, scoring, objective='roc_auc')


In [None]:
best_classifier

#4) ?

Construir un modelo RandomForest con 200 árboles. Explorar para qué sirve el hiperparámetro max_features y cómo afecta a la performance del algoritmo mediante una curva de complejidad. Explicar por qué creen que se dieron los resultados obtenidos. Por último, graficar una curva de aprendizaje sobre los parámetros elegidos para determinar si sería útil o no conseguir más datos.

max features = The number of features to consider when looking for the best split // 
n_estimators = cantidad de árboles 

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier(n_estimators = 200)

In [None]:
# quizas sea util al final


import matplotlib.pyplot as plt

def youden(fpr, tpr, thresholds):
    optimal_idx = np.argmax(tpr - fpr)
    return thresholds[optimal_idx]

def validate_classifiers(classifiers, X_train, y_train, X_test, y_test):
    for name, clf in classifiers.items():
        print('Validation for {0}'.format(name))
        pipeline_clf = make_pipeline(preprocessing.StandardScaler(), clf)
        pipeline_clf.fit(X_train, y_train)
        metrics.plot_roc_curve(pipeline_clf, X_test, y_test)
        churn_probas = pipeline_clf.predict_proba(X_test)[:, 1]
        fpr, tpr, thresholds = metrics.roc_curve(y_test, churn_probas)
        youden_score = youden(fpr, tpr, thresholds)
        print('Youden spot ---> {0}'.format(youden_score))
        plt.show()