# Proyecto Final Parte II: Clasificación-ensambles

![Logo Tec](img/LogoTec2.jpg)

## Ciencia y analítica de datos (Gpo 10)
### Alumnos: 
* Armando Bringas Corpus (A01200230), 
* Walter André Hauri Rosales (A01794237)

### Profesores: 
* Dra. María de la Paz Rico Fernández
* Mtra. Victoria Guerrero Orozco

### Fecha: 18 de noviembre de 2022

In [None]:
try:
    from jupyterthemes import jtplot
    jtplot.style(theme='monokai', context ='notebook', ticks =True, grid =False) 
except ImportError as error:
   print('Not running in a Jupyter Notebook')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score, classification_report, precision_recall_curve, PrecisionRecallDisplay, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import label_binarize
from sklearn.tree import DecisionTreeClassifier

In [None]:
data = pd.read_csv("data/Datos_de_calidad_del_agua_de_sitios_de_monitoreo_de_aguas_superficiales_2020_limpio.csv", index_col=0)
data.head()

## Selecciona de variables independientes X y dependiente Y (semáforo)

In [None]:
y = data["SEMAFORO"]
X = data.drop(columns=["SEMAFORO"])

In [None]:
X = X.select_dtypes(["float64"])

Cambia a label encoding el semáforo, ej, de ["clase 1", "clase 2", "clase 3"] a [ 1,2,3]. Desde la limpieza de datos se implemento el "label encoding" en la variable de salida

In [None]:
colores_semaforo = {'Amarillo': 0, 'Rojo': 1, 'Verde': 2}
y.unique() # Desde la limpieza se convritió a variable categórica

###  Análisis general de las features importances a traves de decision trees o random forest

In [None]:
modelo_DT = DecisionTreeClassifier()
clf = modelo_DT.fit(X, y)

fig, ax = plt.subplots()
plt.barh(X.columns, width = clf.feature_importances_)
ax.set_title("Variables de mayor importancia usando Decision Trees")
fig.tight_layout()
plt.show()

In [None]:
modelo_RF = RandomForestClassifier()
clf = modelo_RF.fit(X, y)

fig, ax = plt.subplots()
plt.barh(X.columns, width = clf.feature_importances_);
ax.set_title("Variables de mayor importancia usando Random Forest")
fig.tight_layout()
plt.show()

## Importancia de las Variables

### Primer Clasificador de las variables seleccionadas

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45, stratify=y)

In [None]:
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
forest_importances = pd.Series(importances, index=X.columns)

fig, ax = plt.subplots()
forest_importances.plot.barh(yerr=std, ax=ax)
ax.set_title("Variables de mayor importancia usando MDI (Mean Decrease in Impurity)")
fig.tight_layout()
plt.show()

In [None]:
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)

forest_importances = pd.Series(result.importances_mean, index=X.columns)

fig, ax = plt.subplots()
forest_importances.plot.barh(yerr=result.importances_std, ax=ax)
ax.set_title("Variables de mayor importancia usando permutación en el modelo completo")
fig.tight_layout()
plt.show()

Selecciona las variables de mayor importancia

Explora que clasificador es el más optimo.

Determina el grado de exactitud a través del reporte de clasificación y análisis de la gráfica de Precision Recall.

## Modelo

### Segundo clasificador con las variables más importantes

### Random Forest

In [None]:
mi_modelo_RF = RandomForestClassifier(random_state=45)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=45) # k-folds cross-validation

dicc_grid = {'ccp_alpha':[0.0, 0.0001, 0.001, 0.01, 0.1],
             'criterion': ["gini", "entropy", "log_loss"],
             'max_depth':range(1,21,2),
             'min_samples_split':range(2, 12, 2),
            }

grid = GridSearchCV(estimator=mi_modelo_RF, param_grid=dicc_grid, cv=cv, scoring="accuracy", n_jobs=-1)

grid.fit(X_train, np.ravel(y_train))

print('Mejor valor de accuracy obtenido con la mejor combinación:', grid.best_score_)
print('Mejor combinación de valores encontrados de los hiperparámetros:', grid.best_params_)
print('Métrica utilizada:', grid.scoring)

In [None]:
mejor_RF = grid.best_estimator_
y_hat = mejor_RF.predict(X_test)

## Metrics report
print(classification_report(y_test, y_hat, target_names=colores_semaforo))

## Precision-recall plot
y_test_bin = label_binarize(y_test, classes=y.unique())
y_score = mejor_RF.predict_proba(X_test)

# For each class
precision = dict()
recall = dict()
average_precision = dict()
n_classes = y_train.unique().shape[0]
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(y_test_bin[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(y_test_bin[:, i], y_score[:, i])

# A "micro-average": quantifying score on all classes jointly
precision["micro"], recall["micro"], _ = precision_recall_curve(
    y_test_bin.ravel(), y_score.ravel()
)
average_precision["micro"] = average_precision_score(y_test_bin, y_score, average="micro")
display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
)
display.plot()
_ = display.ax_.set_title("Micro-averaged over all classes")

### Decision Tree

In [None]:
mi_modelo_RF = DecisionTreeClassifier(random_state=45)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=45) # k-folds cross-validation

dicc_grid = {'ccp_alpha':[0.0, 0.0001, 0.001, 0.01, 0.1],
             'criterion': ["gini", "entropy", "log_loss"],
             'max_depth':range(1,21,2),
             'min_samples_split':range(2, 12, 2),
             'class_weight':[None,'balanced'],
            }

grid = GridSearchCV(estimator=mi_modelo_RF, param_grid=dicc_grid, cv=cv, scoring="accuracy", n_jobs=-1)

grid.fit(X_train, np.ravel(y_train))

print('Mejor valor de accuracy obtenido con la mejor combinación:', grid.best_score_)
print('Mejor combinación de valores encontrados de los hiperparámetros:', grid.best_params_)
print('Métrica utilizada:', grid.scoring)

In [None]:
mejor_DT = grid.best_estimator_
y_hat = mejor_DT.predict(X_test)

## Metrics report
print(classification_report(y_test, y_hat, target_names=colores_semaforo))

## Precision-recall plot
y_score = mejor_DT.predict_proba(X_test)

# For each class
precision = dict()
recall = dict()
average_precision = dict()
n_classes = y_train.unique().shape[0]
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(y_test_bin[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(y_test_bin[:, i], y_score[:, i])

# A "micro-average": quantifying score on all classes jointly
precision["micro"], recall["micro"], _ = precision_recall_curve(
    y_test_bin.ravel(), y_score.ravel()
)
average_precision["micro"] = average_precision_score(y_test_bin, y_score, average="micro")
display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
)
display.plot()
_ = display.ax_.set_title("Micro-averaged over all classes")

## Confusion matrix
cf_matrix = confusion_matrix(y_test, y_hat)
ConfusionMatrixDisplay.from_predictions(y_test, y_hat);

## Análisis de Resultados con modelo de Decision Trees y Random Forest

Visualiza los resultados del modelo o las predicciones a través de una matriz de confusión

In [None]:
## Confusion matrix
y_hat = mejor_RF.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_hat)
ConfusionMatrixDisplay.from_predictions(y_test, y_hat);

In [None]:
y_hat = mejor_DT.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_hat)
ConfusionMatrixDisplay.from_predictions(y_test, y_hat);

## Conclusiones