# Modelo predictivo - Devoluciones Gaviequipos
Por :
* Felipe Loaiza Martinez
* Clara Isabela Otalvaro Agudelo

(1.5) Crear un modelo predictivo avanzado en Python, donde:

a. Se balancea sólo el 70% de los datos (en caso de ser necesario el balanceo)

b. Se realiza una validación cruzada con el 70%

c. Se aplican 4 métodos de aprendizaje supervisado de máquinas

d. Se aplican 3 métodos de ensamble

e. Se calculan al menos 4 medidas de calidad de cada modelo y se comparan para seleccionar los mejores modelos. Se deben interpretar todas las medidas obtenidas.

f. De los 7 modelos creados, se seleccionan los 3 mejores. Para seleccionar los mejores modelos se debe aplicar un proceso de análisis de diferencia estadística significativa (ANOVA y Tukey).

g. Los 3 modelos seleccionados deben pasar por un proceso de hiperparametrización con gridsearch y optimización (algoritmos genéticos/optimización bayesiana). El mejor modelo resultante se almacena para ser llevado a despliegue.

h. El modelo final se debe almacenar en un Pipe con las operaciones de preparación de los datos para el despliegue.

i. Se realiza un despliegue con interfaz gráfica

In [224]:
#Importamos librerías básicas
import pandas as pd # manipulacion dataframes
import numpy as np  # matrices y vectores
import matplotlib.pyplot as plt #gráfica

#Librerías para el Pipe
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder

## Preparación de los datos

In [225]:
data = pd.read_csv('data_no_balanceada.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3184 entries, 0 to 3183
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   art_nombre         3184 non-null   object
 1   art_vr_reposicion  3184 non-null   int64 
 2   cco_nombre         3184 non-null   object
 3   predevolucion      3184 non-null   int64 
 4   estado             3184 non-null   object
dtypes: int64(2), object(3)
memory usage: 124.5+ KB


In [226]:
data['art_nombre'] = data['art_nombre'].astype('category')
data['cco_nombre'] = data['cco_nombre'].astype('category')
data['estado'] = data['estado'].astype('category')
data['predevolucion'] = data['predevolucion'].astype('category')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3184 entries, 0 to 3183
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   art_nombre         3184 non-null   category
 1   art_vr_reposicion  3184 non-null   int64   
 2   cco_nombre         3184 non-null   category
 3   predevolucion      3184 non-null   category
 4   estado             3184 non-null   category
dtypes: category(4), int64(1)
memory usage: 38.2 KB


## Pipeline preparación de datos

In [227]:
#LabelEncoder para la variable objetivo
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data["estado"]=labelencoder.fit_transform(data["estado"])
data.head()

Unnamed: 0,art_nombre,art_vr_reposicion,cco_nombre,predevolucion,estado
0,Formaleta,30000,BODEGA METALMEGA,0,1
1,Andamios,2000,AMATISTA LIVING,1,1
2,Andamios,150000,AMATISTA LIVING,1,0
3,Andamios,80000,AMATISTA LIVING,1,1
4,Equipo_multi,297000,BODEGA METALMEGA,0,0


In [228]:
#Separar predictoras y objetivo
X = data.drop("estado", axis = 1) # Variables predictoras
Y = data['estado'] #Variable objetivo

In [229]:
# Definir las columnas categóricas y numéricas
categorical_cols = ['art_nombre', 'cco_nombre','predevolucion']
numeric_cols = ['art_vr_reposicion']

In [230]:
# Para variables numéricas: Imputar por media y normalizar
num_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

# Para variables categóricas: imputar por moda y crear dummies
cat_transformer = Pipeline(steps=[
    ('dummies', OneHotEncoder(drop='if_binary',handle_unknown='ignore', sparse_output=False))
])

# Unir los dos pasos anteriores
preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, numeric_cols),
    ('cat', cat_transformer, categorical_cols)
])

pipe = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

pipe

In [231]:
# #Aplicar el pipe a los datos
# X_processed = pd.DataFrame(pipe.fit_transform(X), columns=pipe.named_steps['preprocessor'].get_feature_names_out())
# X_processed.info()

## División 70-30


In [232]:
#División 70-30
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, stratify=Y)

## Balanceo 70%

In [233]:
from imblearn.over_sampling import SMOTENC
# Encode categorical features (Feature1 is at index 0)
categorical_features = [0,2,3]  # Index of categorical features in X

# Apply SMOTENC
smote_nc = SMOTENC(categorical_features=categorical_features, random_state=42)
X_resampled, Y_resampled = smote_nc.fit_resample(X_train, Y_train)

## Validación Cruzada

In [234]:
X_transformed = pd.DataFrame(pipe.fit_transform(X_resampled), columns=pipe.named_steps['preprocessor'].get_feature_names_out())
X_train = X_transformed.copy()
Y_train = Y_resampled.copy()

## Arboles de clasificación

In [235]:
f1_scores = pd.DataFrame()

In [236]:
#Método de ML a usar en la validación cruzada
from sklearn import tree
modelTree = tree.DecisionTreeClassifier(criterion='gini', min_samples_leaf=10, max_depth=16)

from sklearn.model_selection import cross_validate

#Validación Cruzada: division, aprendizaje, evaluacion
scoresTree = cross_validate(modelTree, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresTree =pd.DataFrame(scoresTree) #Se almacenan los resultados en un dataframe

f1_scores["Tree"] = scoresTree['test_f1']

In [237]:
#Transformamos las variables de X_test a traves del pipeline
X_test = pd.DataFrame(pipe.transform(X_test), columns=pipe.named_steps['preprocessor'].get_feature_names_out())


In [238]:
evalData = pd.DataFrame(index=["F1", "Accuracy", "Precision", "Recall", "ROC"])

In [239]:
def CalcMetrics(Y_test,Y_pred):
  #f1 score
  f1=metrics.f1_score(Y_test,Y_pred)

  #accuracy
  accuracy= metrics.accuracy_score(Y_test,Y_pred)

  #precision
  precision=metrics.precision_score(Y_test,Y_pred)

  #recall
  recall=metrics.recall_score(Y_test,Y_pred)

  #roc_auc_score
  roc = metrics.roc_auc_score(Y_test,Y_pred)

  return [f1, accuracy, precision, recall, roc]

In [240]:
from sklearn import metrics
#Modelo Final con todos los datos
modelTree.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelTree.predict(X_test)
evalData["Tree"] = CalcMetrics(Y_test,Y_pred)

## Redes Neuronales

In [241]:
#Red neuronal
from sklearn.neural_network import MLPClassifier
modelRN =  MLPClassifier(activation="relu",hidden_layer_sizes=(5,8), learning_rate='adaptive',
                     learning_rate_init=0.02, momentum= 0.3, max_iter=1000, verbose=False)

#Validación Cruzada: division, aprendizaje, evaluacion
scoresRN = cross_validate(modelRN, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresRN = pd.DataFrame(scoresRN) #Se almacenan los resultados en un dataframe

f1_scores["RN"] = scoresRN['test_f1']

In [242]:
#Modelo Final con todos los datos
modelRN.fit(X_train, Y_train) #100%
#Calculamos metricas
Y_pred = modelRN.predict(X_test) #30% Test
evalData["RN"] = CalcMetrics(Y_test,Y_pred)

## Máquinas de soporte vectorial

In [243]:
#SVM
from sklearn.svm import SVC # SVR

modelSVM = SVC(kernel='linear') #'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'

#Validación Cruzada: division, aprendizaje, evaluacion
scoresSVM = cross_validate(modelSVM, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresSVM = pd.DataFrame(scoresSVM) #Se almacenan los resultados en un dataframe

f1_scores["SVM"] = scoresSVM['test_f1']

In [244]:
#Modelo Final con todos los datos
modelSVM.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelSVM.predict(X_test)
evalData["SVM"] = CalcMetrics(Y_test,Y_pred)

## Naive Bayes

In [245]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
modelNB = GaussianNB()

#Validación Cruzada: division, aprendizaje, evaluacion
scoresNB = cross_validate(modelNB, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresNB = pd.DataFrame(scoresNB) #Se almacenan los resultados en un dataframe

f1_scores["NB"] = scoresNB['test_f1']

In [246]:
#Modelo Final con todos los datos
modelNB.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelNB.predict(X_test)
evalData["NB"] = CalcMetrics(Y_test,Y_pred)

## Modelo con Bagging

In [247]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

modelo_base=KNeighborsClassifier(n_neighbors=3, metric='euclidean')
modelBAG = BaggingClassifier(modelo_base, n_estimators=10, max_samples=0.6) #n_estimators=100

scoresBAG = cross_validate(modelBAG, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresBAG = pd.DataFrame(scoresBAG) #Se almacenan los resultados en un dataframe

f1_scores["BAG"] = scoresBAG['test_f1']

In [248]:
#Modelo Final con todos los datos
modelBAG.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelBAG.predict(X_test)
evalData["BAG"] = CalcMetrics(Y_test,Y_pred)

## Random Forest

In [249]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier
modelRF= RandomForestClassifier(n_estimators=75,  max_samples=0.7, criterion='gini',
                              max_depth=None, min_samples_leaf=3) #Max samples se usa para el baggin de caracteristicas
scoresRF = cross_validate(modelRF, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresRF = pd.DataFrame(scoresRF) #Se almacenan los resultados en un dataframe

f1_scores["RF"] = scoresRF['test_f1']

In [250]:
#Modelo Final con todos los datos
modelRF.fit(X_train, Y_train) #

#Calculamos metricas
Y_pred = modelRF.predict(X_test)
evalData["RF"] = CalcMetrics(Y_test,Y_pred)

## Hard Voting

In [251]:
from sklearn.ensemble import VotingClassifier
model_dt = tree.DecisionTreeClassifier(criterion='gini', min_samples_leaf=20, max_depth=5)
model_knn = KNeighborsClassifier(n_neighbors=2, metric='euclidean')
model_rn = MLPClassifier(activation="relu",hidden_layer_sizes=(15), learning_rate='constant',
                     learning_rate_init=0.02, momentum= 0.3, max_iter=500, verbose=False)
clasificadores= [('dt', model_dt), ('knn', model_knn), ('net', model_rn)]

modelVH = VotingClassifier(estimators=clasificadores, voting='hard')

scoresVH = cross_validate(modelVH, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresVH = pd.DataFrame(scoresVH) #Se almacenan los resultados en un dataframe

f1_scores["VH"] = scoresVH['test_f1']

In [252]:
#Modelo Final con todos los datos
modelVH.fit(X_train,Y_train)

#Calculamos metricas
Y_pred = modelVH.predict(X_test)
evalData["VH"] = CalcMetrics(Y_test,Y_pred)

In [253]:
evalData

Unnamed: 0,Tree,RN,SVM,NB,BAG,RF,VH
F1,0.359477,0.323671,0.264078,0.280467,0.380952,0.380282,0.332344
Accuracy,0.794979,0.707113,0.603556,0.549163,0.809623,0.8159,0.764644
Precision,0.264423,0.212025,0.16307,0.167665,0.285714,0.290323,0.23431
Recall,0.561224,0.683673,0.693878,0.857143,0.571429,0.55102,0.571429
ROC,0.691451,0.696732,0.643559,0.685564,0.704129,0.698587,0.679071


In [254]:
f1_scores

Unnamed: 0,Tree,RN,SVM,NB,BAG,RF,VH
0,0.776903,0.792711,0.782805,0.783158,0.740947,0.78866,0.776942
1,0.856471,0.790123,0.777293,0.786885,0.861386,0.88729,0.846715
2,0.874074,0.815851,0.772321,0.794926,0.894866,0.886747,0.857143
3,0.827423,0.810811,0.784648,0.776423,0.835322,0.836105,0.807107
4,0.871671,0.807601,0.757576,0.773931,0.84878,0.872289,0.82
5,0.877451,0.847222,0.786957,0.773931,0.887781,0.897561,0.887781
6,0.845209,0.795,0.754386,0.766667,0.840506,0.865526,0.80597
7,0.828571,0.781327,0.777528,0.788913,0.801034,0.838235,0.765172
8,0.868293,0.824601,0.753191,0.765182,0.870886,0.872549,0.841076
9,0.869779,0.807425,0.784922,0.761317,0.86783,0.878412,0.848921


In [255]:
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

anova_result = stats.f_oneway(f1_scores['Tree'], f1_scores['RN'], f1_scores['SVM'], f1_scores['NB'], f1_scores['BAG'], f1_scores['RF'], f1_scores['VH'])
print("ANOVA F-statistic:", anova_result.statistic)
print("ANOVA p-value:", anova_result.pvalue)

ANOVA F-statistic: 14.072517372203805
ANOVA p-value: 4.3647826835556166e-10


In [263]:
f1_scores_test = evalData.loc["F1"].sort_values(ascending=False)

In [264]:
f1_scores_test

Unnamed: 0,F1
BAG,0.380952
RF,0.380282
Tree,0.359477
VH,0.332344
RN,0.323671
NB,0.280467
SVM,0.264078


In [265]:
f1_score_model_df = pd.DataFrame(columns=['value', 'model'])

# Iterate through the f1_scores DataFrame and extract the F1 scores and model names
for model_name in f1_scores.columns:
  for f1_score in f1_scores[model_name]:
    f1_score_model_df = pd.concat([f1_score_model_df, pd.DataFrame({'value': [f1_score], 'model': [model_name]})], ignore_index=True)

  f1_score_model_df = pd.concat([f1_score_model_df, pd.DataFrame({'value': [f1_score], 'model': [model_name]})], ignore_index=True)


In [266]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('value ~ C(model)', data=f1_score_model_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

            sum_sq    df          F        PR(>F)
C(model)  0.075151   6.0  14.072517  4.364783e-10
Residual  0.056073  63.0        NaN           NaN


In [267]:
tukey = pairwise_tukeyhsd(endog=f1_score_model_df['value'], groups=f1_score_model_df['model'], alpha=0.05)
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
   BAG     NB  -0.0678 0.0001 -0.1084 -0.0272   True
   BAG     RF   0.0174 0.8474 -0.0232   0.058  False
   BAG     RN  -0.0377 0.0867 -0.0783   0.003  False
   BAG    SVM  -0.0718    0.0 -0.1124 -0.0311   True
   BAG   Tree   0.0047 0.9998  -0.036  0.0453  False
   BAG     VH  -0.0193 0.7764 -0.0599  0.0214  False
    NB     RF   0.0852    0.0  0.0446  0.1258   True
    NB     RN   0.0301 0.2804 -0.0105  0.0708  False
    NB    SVM   -0.004 0.9999 -0.0446  0.0367  False
    NB   Tree   0.0725    0.0  0.0318  0.1131   True
    NB     VH   0.0485 0.0095  0.0079  0.0892   True
    RF     RN  -0.0551  0.002 -0.0957 -0.0144   True
    RF    SVM  -0.0892    0.0 -0.1298 -0.0485   True
    RF   Tree  -0.0128 0.9614 -0.0534  0.0279  False
    RF     VH  -0.0367 0.1034 -0.0773   0.004  False
    RN    SVM  -0.0341 0.1575 -0.0747  0.0065 

In [268]:
tukey_results = pd.DataFrame(data=tukey.summary().data[1:], columns=tukey.summary().data[0])
significant_pairs = tukey_results[tukey_results['reject'] == True]

In [269]:
significant_pairs

Unnamed: 0,group1,group2,meandiff,p-adj,lower,upper,reject
0,BAG,NB,-0.0678,0.0001,-0.1084,-0.0272,True
3,BAG,SVM,-0.0718,0.0,-0.1124,-0.0311,True
6,NB,RF,0.0852,0.0,0.0446,0.1258,True
9,NB,Tree,0.0725,0.0,0.0318,0.1131,True
10,NB,VH,0.0485,0.0095,0.0079,0.0892,True
11,RF,RN,-0.0551,0.002,-0.0957,-0.0144,True
12,RF,SVM,-0.0892,0.0,-0.1298,-0.0485,True
16,RN,Tree,0.0423,0.0359,0.0017,0.083,True
18,SVM,Tree,0.0764,0.0,0.0358,0.1171,True
19,SVM,VH,0.0525,0.0038,0.0119,0.0932,True


## Los modelo seleccionados fueron los siguientes.
BAG , RF, Tree puesto que son los que mejor f1 score poseen sobre los datos de testeo, y ademas porque segun el test de Tukey, no presentan diferencias significativas entre ellos.

# HiperParametrización

## Bagging

Grid Search

In [270]:
# Grid Search
from sklearn.model_selection import GridSearchCV

# Hiperparámetros para buscar
param_grid_BAG = {
    'n_estimators': [5, 10, 15],
    'max_samples':[0.6, 0.7, 0.8]
}

grid_search_BAG = GridSearchCV(modelBAG, param_grid_BAG, cv=10, scoring='f1', n_jobs=-1) #maximiza el scoring
grid_search_BAG.fit(X_train, Y_train) #70%

grid_best_model_BAG = grid_search_BAG.best_estimator_

Bayes Search

In [271]:
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-24.9.0-py3-none-any.whl.metadata (11 kB)
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyaml-24.9.0-py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-24.9.0 scikit-optimize-0.10.2


In [272]:
from skopt.space import Real, Integer
from skopt import BayesSearchCV

param_grid_BAG_bayes = {
    'n_estimators': Integer(5, 15),
    'max_samples':Real(0.6,0.8)
}

bayes_search_BAG = BayesSearchCV(
    estimator=modelBAG,
    search_spaces=param_grid_BAG_bayes,
    n_iter=20,       # Número de iteraciones de búsqueda
    cv=10,            # Número de divisiones en la validación cruzada
    n_jobs=-1,        # Utilizar todos los núcleos disponibles
    scoring='f1', #Se maximiza el scoring
    refit=True #el mejor modelo encontrado durante la búsqueda se vuelve a ajustar (re-entrenar) usando todos los dato
)

# Realizar la búsqueda de hiperparámetros
bayes_search_BAG.fit(X_train, Y_train) #70%

bayes_best_model_BAG = bayes_search_BAG.best_estimator_

## Tree

grid search

In [275]:
# Hiperparámetros para buscar
param_grid_Tree = {
    'max_depth': [10, 15, 20],
    'min_samples_leaf': [10, 15, 20]
}

grid_search_Tree = GridSearchCV(modelTree, param_grid_Tree, cv=10, scoring='f1', n_jobs=-1) #maximiza el scoring
grid_search_Tree.fit(X_train, Y_train) #70%

grid_best_model_Tree = grid_search_Tree.best_estimator_

bayes search

In [276]:
param_grid_Tree_bayes = {
    'max_depth': Integer(10, 20),
    'min_samples_leaf': Integer(10, 20)
}
bayes_search_Tree = BayesSearchCV(
    estimator=modelTree,
    search_spaces=param_grid_Tree_bayes,
    n_iter=20,       # Número de iteraciones de búsqueda
    cv=10,            # Número de divisiones en la validación cruzada
    n_jobs=-1,        # Utilizar todos los núcleos disponibles
    scoring='f1', #Se maximiza el scoring
    refit=True #el mejor modelo encontrado durante la búsqueda se vuelve a ajustar (re-entrenar) usando todos los dato
)

# Realizar la búsqueda de hiperparámetros
bayes_search_Tree.fit(X_train, Y_train) #70%

bayes_best_model_Tree = bayes_search_Tree.best_estimator_

## Random Forest

Grid Search

In [277]:
# Hiperparámetros para buscar
param_grid_RF = {
    'n_estimators': [50, 75, 100],
    'criterion': ['gini', 'entropy']
}
grid_search_RF = GridSearchCV(modelRF, param_grid_RF, cv=10, scoring='f1', n_jobs=-1) #maximiza el scoring
grid_search_RF.fit(X_train, Y_train) #70%

grid_best_model_RF = grid_search_RF.best_estimator_

Bayes Search

In [278]:
param_grid_RF_bayes = {
    'n_estimators': Integer(50, 100),
    'criterion': ['gini', 'entropy']
}

bayes_search_RF = BayesSearchCV(
    estimator=modelRF,
    search_spaces=param_grid_RF_bayes,
    n_iter=20,       # Número de iteraciones de búsqueda
    cv=10,            # Número de divisiones en la validación cruzada
    n_jobs=-1,        # Utilizar todos los núcleos disponibles
    scoring='f1', #Se maximiza el scoring
    refit=True #el mejor modelo encontrado durante la búsqueda se vuelve a ajustar (re-entrenar) usando todos los dato
)

# Realizar la búsqueda de hiperparámetros
bayes_search_RF.fit(X_train, Y_train) #70%

bayes_best_model_RF = bayes_search_RF.best_estimator_

Evaluamos los modelos entrenados por GridSearch y BayesSearch

In [287]:
f1_final_scores = pd.DataFrame(index=["F1"])

In [288]:
#Calculamos metricas
Y_pred = grid_best_model_BAG.predict(X_test)
f1_final_scores["BAG_grid"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = bayes_best_model_BAG.predict(X_test)
f1_final_scores["BAG_bayes"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = grid_best_model_Tree.predict(X_test)
f1_final_scores["Tree_grid"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = bayes_best_model_Tree.predict(X_test)
f1_final_scores["Tree_bayes"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = grid_best_model_RF.predict(X_test)
f1_final_scores["RF_grid"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = bayes_best_model_RF.predict(X_test)
f1_final_scores["RF_bayes"] = metrics.f1_score(Y_test,Y_pred)

In [292]:
f1_final_scores

Unnamed: 0,BAG_grid,BAG_bayes,Tree_grid,Tree_bayes,RF_grid,RF_bayes
F1,0.401606,0.410042,0.359477,0.359477,0.38488,0.385965


In [293]:
bayes_search_BAG.best_params_

OrderedDict([('max_samples', 0.799550464470973), ('n_estimators', 15)])

Luego de evaluar cada modelo sobre los datos de testeo, el modelo que mejor nos entrega un f1 score es el encontrado por BayesSearch en Bagging.

## Exportamos, labelEncoder, Pipeline, Modelo

In [294]:
X.columns._values

array(['art_nombre', 'art_vr_reposicion', 'cco_nombre', 'predevolucion'],
      dtype=object)

In [296]:
import pickle
filename = 'modeloFinal_GaviEquipos.pkl'
variables=X.columns._values
pickle.dump([bayes_best_model_BAG, pipe, labelencoder, variables], open(filename, 'wb'))

In [285]:
print("holaaaa")

holaaaa
