(1.5) Crear un modelo predictivo avanzado en Python, donde:

a. Se balancea sólo el 70% de los datos (en caso de ser necesario el balanceo)

b. Se realiza una validación cruzada con el 70%

c. Se aplican 4 métodos de aprendizaje supervisado de máquinas

d. Se aplican 3 métodos de ensamble

e. Se calculan al menos 4 medidas de calidad de cada modelo y se comparan para seleccionar los mejores modelos. Se deben interpretar todas las medidas obtenidas.

f. De los 7 modelos creados, se seleccionan los 3 mejores. Para seleccionar los mejores modelos se debe aplicar un proceso de análisis de diferencia estadística significativa (ANOVA y Tukey).

g. Los 3 modelos seleccionados deben pasar por un proceso de hiperparametrización con gridsearch y optimización (algoritmos genéticos/optimización bayesiana). El mejor modelo resultante se almacena para ser llevado a despliegue.

h. El modelo final se debe almacenar en un Pipe con las operaciones de preparación de los datos para el despliegue.

i. Se realiza un despliegue con interfaz gráfica

In [249]:
#Importamos librerías básicas
import pandas as pd # manipulacion dataframes
import numpy as np  # matrices y vectores
import matplotlib.pyplot as plt #gráfica

#Librerías para el Pipe
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder

## Preparación de los datos

In [250]:
data = pd.read_csv('data_no_balanceada.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3184 entries, 0 to 3183
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   art_nombre         3184 non-null   object
 1   art_vr_reposicion  3184 non-null   int64 
 2   cco_nombre         3184 non-null   object
 3   estado             3184 non-null   object
dtypes: int64(1), object(3)
memory usage: 99.6+ KB


In [251]:
data['art_nombre'] = data['art_nombre'].astype('category')
data['cco_nombre'] = data['cco_nombre'].astype('category')
data['estado'] = data['estado'].astype('category')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3184 entries, 0 to 3183
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   art_nombre         3184 non-null   category
 1   art_vr_reposicion  3184 non-null   int64   
 2   cco_nombre         3184 non-null   category
 3   estado             3184 non-null   category
dtypes: category(3), int64(1)
memory usage: 35.0 KB


## Pipeline preparación de datos

In [252]:
#LabelEncoder para la variable objetivo
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data["estado"]=labelencoder.fit_transform(data["estado"])
data.head()

Unnamed: 0,art_nombre,art_vr_reposicion,cco_nombre,estado
0,Formaleta,30000,BODEGA METALMEGA,1
1,Andamios,2000,AMATISTA LIVING,1
2,Andamios,150000,AMATISTA LIVING,0
3,Andamios,80000,AMATISTA LIVING,1
4,Equipo_multi,297000,BODEGA METALMEGA,0


In [253]:
#Separar predictoras y objetivo
X = data.drop("estado", axis = 1) # Variables predictoras
Y = data['estado'] #Variable objetivo

In [254]:
# Definir las columnas categóricas y numéricas
categorical_cols = ['art_nombre', 'cco_nombre']
numeric_cols = ['art_vr_reposicion']

In [255]:
# Para variables numéricas: Imputar por media y normalizar
num_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

# Para variables categóricas: imputar por moda y crear dummies
cat_transformer = Pipeline(steps=[
    ('dummies', OneHotEncoder(drop='if_binary',handle_unknown='ignore', sparse_output=False))
])

# Unir los dos pasos anteriores
preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, numeric_cols),
    ('cat', cat_transformer, categorical_cols)
])

pipe = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

pipe

In [256]:
# #Aplicar el pipe a los datos
# X_processed = pd.DataFrame(pipe.fit_transform(X), columns=pipe.named_steps['preprocessor'].get_feature_names_out())
# X_processed.info()

## División 70-30


In [257]:
#División 70-30
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, stratify=Y)

## Balanceo 70%

In [258]:
from imblearn.over_sampling import SMOTENC
# Encode categorical features (Feature1 is at index 0)
categorical_features = [0,2]  # Index of categorical features in X

# Apply SMOTENC
smote_nc = SMOTENC(categorical_features=categorical_features, random_state=42)
X_resampled, Y_resampled = smote_nc.fit_resample(X_train, Y_train)

## Validación Cruzada

In [259]:
X_transformed = pd.DataFrame(pipe.fit_transform(X_resampled), columns=pipe.named_steps['preprocessor'].get_feature_names_out())
X_train = X_transformed.copy()
Y_train = Y_resampled.copy()

## Arboles de clasificación

In [260]:
f1_scores = pd.DataFrame()

In [261]:
#Método de ML a usar en la validación cruzada
from sklearn import tree
modelTree = tree.DecisionTreeClassifier(criterion='gini', min_samples_leaf=10, max_depth=16)

from sklearn.model_selection import cross_validate

#Validación Cruzada: division, aprendizaje, evaluacion
scoresTree = cross_validate(modelTree, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresTree =pd.DataFrame(scoresTree) #Se almacenan los resultados en un dataframe

f1_scores["Tree"] = scoresTree['test_f1']

In [262]:
#Transformamos las variables de X_test a traves del pipeline
X_test = pd.DataFrame(pipe.transform(X_test), columns=pipe.named_steps['preprocessor'].get_feature_names_out())


In [263]:
evalData = pd.DataFrame(index=["F1", "Accuracy", "Precision", "Recall", "ROC"])

In [264]:
def CalcMetrics(Y_test,Y_pred):
  #f1 score
  f1=metrics.f1_score(Y_test,Y_pred)

  #accuracy
  accuracy= metrics.accuracy_score(Y_test,Y_pred)

  #precision
  precision=metrics.precision_score(Y_test,Y_pred)

  #recall
  recall=metrics.recall_score(Y_test,Y_pred)

  #roc_auc_score
  roc = metrics.roc_auc_score(Y_test,Y_pred)

  return [f1, accuracy, precision, recall, roc]

In [265]:
from sklearn import metrics
#Modelo Final con todos los datos
modelTree.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelTree.predict(X_test)
evalData["Tree"] = CalcMetrics(Y_test,Y_pred)

## Redes Neuronales

In [266]:
#Red neuronal
from sklearn.neural_network import MLPClassifier
modelRN =  MLPClassifier(activation="relu",hidden_layer_sizes=(5,8), learning_rate='adaptive',
                     learning_rate_init=0.02, momentum= 0.3, max_iter=1000, verbose=False)

#Validación Cruzada: division, aprendizaje, evaluacion
scoresRN = cross_validate(modelRN, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresRN = pd.DataFrame(scoresRN) #Se almacenan los resultados en un dataframe

f1_scores["RN"] = scoresRN['test_f1']

In [267]:
#Modelo Final con todos los datos
modelRN.fit(X_train, Y_train) #100%
#Calculamos metricas
Y_pred = modelRN.predict(X_test) #30% Test
evalData["RN"] = CalcMetrics(Y_test,Y_pred)

## Máquinas de soporte vectorial

In [268]:
#SVM
from sklearn.svm import SVC # SVR

modelSVM = SVC(kernel='linear') #'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'

#Validación Cruzada: division, aprendizaje, evaluacion
scoresSVM = cross_validate(modelSVM, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresSVM = pd.DataFrame(scoresSVM) #Se almacenan los resultados en un dataframe

f1_scores["SVM"] = scoresSVM['test_f1']

In [269]:
#Modelo Final con todos los datos
modelSVM.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelSVM.predict(X_test)
evalData["SVM"] = CalcMetrics(Y_test,Y_pred)

## Naive Bayes

In [270]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
modelNB = GaussianNB()

#Validación Cruzada: division, aprendizaje, evaluacion
scoresNB = cross_validate(modelNB, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresNB = pd.DataFrame(scoresNB) #Se almacenan los resultados en un dataframe

f1_scores["NB"] = scoresNB['test_f1']

In [271]:
#Modelo Final con todos los datos
modelNB.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelNB.predict(X_test)
evalData["NB"] = CalcMetrics(Y_test,Y_pred)

## Modelo con Bagging

In [272]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

modelo_base=KNeighborsClassifier(n_neighbors=3, metric='euclidean')
modelBAG = BaggingClassifier(modelo_base, n_estimators=10, max_samples=0.6) #n_estimators=100

scoresBAG = cross_validate(modelBAG, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresBAG = pd.DataFrame(scoresBAG) #Se almacenan los resultados en un dataframe

f1_scores["BAG"] = scoresBAG['test_f1']

In [273]:
#Modelo Final con todos los datos
modelBAG.fit(X_train, Y_train) #100%

#Calculamos metricas
Y_pred = modelBAG.predict(X_test)
evalData["BAG"] = CalcMetrics(Y_test,Y_pred)

## Random Forest

In [274]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier
modelRF= RandomForestClassifier(n_estimators=75,  max_samples=0.7, criterion='gini',
                              max_depth=None, min_samples_leaf=3) #Max samples se usa para el baggin de caracteristicas
scoresRF = cross_validate(modelRF, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall','roc_auc'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresRF = pd.DataFrame(scoresRF) #Se almacenan los resultados en un dataframe

f1_scores["RF"] = scoresRF['test_f1']

In [275]:
#Modelo Final con todos los datos
modelRF.fit(X_train, Y_train) #

#Calculamos metricas
Y_pred = modelRF.predict(X_test)
evalData["RF"] = CalcMetrics(Y_test,Y_pred)

## Hard Voting

In [276]:
from sklearn.ensemble import VotingClassifier
model_dt = tree.DecisionTreeClassifier(criterion='gini', min_samples_leaf=20, max_depth=5)
model_knn = KNeighborsClassifier(n_neighbors=1, metric='euclidean')
model_rn = MLPClassifier(activation="relu",hidden_layer_sizes=(15), learning_rate='constant',
                     learning_rate_init=0.02, momentum= 0.3, max_iter=500, verbose=False)
clasificadores= [('dt', model_dt), ('knn', model_knn), ('net', model_rn)]

modelVH = VotingClassifier(estimators=clasificadores, voting='hard')

scoresVH = cross_validate(modelVH, X_train, Y_train, cv=10, scoring=('f1', 'accuracy','precision', 'recall'), return_train_score=True, return_estimator=False,n_jobs=-1)
scoresVH = pd.DataFrame(scoresVH) #Se almacenan los resultados en un dataframe

f1_scores["VH"] = scoresVH['test_f1']

In [277]:
#Modelo Final con todos los datos
modelVH.fit(X_train,Y_train)

#Calculamos metricas
Y_pred = modelVH.predict(X_test)
evalData["VH"] = CalcMetrics(Y_test,Y_pred)

In [278]:
evalData

Unnamed: 0,Tree,RN,SVM,NB,BAG,RF,VH
F1,0.36,0.352941,0.372603,0.2694,0.381625,0.378571,0.418006
Accuracy,0.799163,0.758368,0.76046,0.478033,0.816946,0.817992,0.810669
Precision,0.267327,0.243243,0.254682,0.157265,0.291892,0.291209,0.305164
Recall,0.55102,0.642857,0.693878,0.938776,0.55102,0.540816,0.663265
ROC,0.689263,0.707209,0.730971,0.682092,0.69917,0.695233,0.745386


In [279]:
f1_scores

Unnamed: 0,Tree,RN,SVM,NB,BAG,RF,VH
0,0.716981,0.755556,0.715789,0.705441,0.736264,0.738292,0.713514
1,0.809278,0.753769,0.733333,0.760163,0.838875,0.840506,0.781553
2,0.852041,0.83293,0.751323,0.768916,0.875949,0.879795,0.862338
3,0.839024,0.758621,0.692913,0.729412,0.852941,0.866828,0.842857
4,0.807198,0.77037,0.729223,0.743434,0.816537,0.823529,0.831683
5,0.807786,0.748235,0.727273,0.762475,0.829146,0.822384,0.797066
6,0.815385,0.781395,0.717678,0.726547,0.84131,0.84264,0.803109
7,0.866337,0.801865,0.763636,0.748988,0.859259,0.857855,0.842365
8,0.819588,0.752475,0.712766,0.741036,0.832487,0.840506,0.79798
9,0.859259,0.781038,0.736292,0.763052,0.860636,0.880196,0.830846


In [280]:
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

anova_result = stats.f_oneway(f1_scores['Tree'], f1_scores['RN'], f1_scores['SVM'], f1_scores['NB'], f1_scores['BAG'], f1_scores['RF'], f1_scores['VH'])
print("ANOVA F-statistic:", anova_result.statistic)
print("ANOVA p-value:", anova_result.pvalue)

ANOVA F-statistic: 16.510849495340132
ANOVA p-value: 2.5081366223805165e-11


In [281]:
f1_scores_test = evalData.loc["F1"].sort_values(ascending=False)

In [282]:
f1_scores_test

Unnamed: 0,F1
VH,0.418006
BAG,0.381625
RF,0.378571
SVM,0.372603
Tree,0.36
RN,0.352941
NB,0.2694


In [283]:
f1_score_model_df = pd.DataFrame(columns=['value', 'model'])

# Iterate through the f1_scores DataFrame and extract the F1 scores and model names
for model_name in f1_scores.columns:
  for f1_score in f1_scores[model_name]:
    f1_score_model_df = pd.concat([f1_score_model_df, pd.DataFrame({'value': [f1_score], 'model': [model_name]})], ignore_index=True)

  f1_score_model_df = pd.concat([f1_score_model_df, pd.DataFrame({'value': [f1_score], 'model': [model_name]})], ignore_index=True)


In [284]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('value ~ C(model)', data=f1_score_model_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

            sum_sq    df          F        PR(>F)
C(model)  0.117462   6.0  16.510849  2.508137e-11
Residual  0.074699  63.0        NaN           NaN


In [285]:
tukey = pairwise_tukeyhsd(endog=f1_score_model_df['value'], groups=f1_score_model_df['model'], alpha=0.05)
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
   BAG     NB  -0.0894    0.0 -0.1363 -0.0425   True
   BAG     RF   0.0049 0.9999  -0.042  0.0518  False
   BAG     RN  -0.0607 0.0037 -0.1076 -0.0138   True
   BAG    SVM  -0.1063    0.0 -0.1532 -0.0594   True
   BAG   Tree  -0.0151 0.9571  -0.062  0.0318  False
   BAG     VH   -0.024 0.7082 -0.0709  0.0229  False
    NB     RF   0.0943    0.0  0.0474  0.1412   True
    NB     RN   0.0287 0.5123 -0.0182  0.0756  False
    NB    SVM  -0.0169 0.9261 -0.0638    0.03  False
    NB   Tree   0.0743 0.0002  0.0274  0.1212   True
    NB     VH   0.0654 0.0014  0.0185  0.1123   True
    RF     RN  -0.0656 0.0013 -0.1125 -0.0187   True
    RF    SVM  -0.1112    0.0 -0.1581 -0.0643   True
    RF   Tree    -0.02  0.851 -0.0669  0.0269  False
    RF     VH  -0.0289 0.5021 -0.0758   0.018  False
    RN    SVM  -0.0456 0.0619 -0.0925  0.0013 

In [292]:
tukey_results = pd.DataFrame(data=tukey.summary().data[1:], columns=tukey.summary().data[0])
significant_pairs = tukey_results[tukey_results['reject'] == True]

In [293]:
significant_pairs

Unnamed: 0,group1,group2,meandiff,p-adj,lower,upper,reject
0,BAG,NB,-0.0894,0.0,-0.1363,-0.0425,True
2,BAG,RN,-0.0607,0.0037,-0.1076,-0.0138,True
3,BAG,SVM,-0.1063,0.0,-0.1532,-0.0594,True
6,NB,RF,0.0943,0.0,0.0474,0.1412,True
9,NB,Tree,0.0743,0.0002,0.0274,0.1212,True
10,NB,VH,0.0654,0.0014,0.0185,0.1123,True
11,RF,RN,-0.0656,0.0013,-0.1125,-0.0187,True
12,RF,SVM,-0.1112,0.0,-0.1581,-0.0643,True
18,SVM,Tree,0.0913,0.0,0.0444,0.1382,True
19,SVM,VH,0.0823,0.0,0.0354,0.1292,True


## Los modelo seleccionados fueron los siguientes.
VH, BAG , RF, puesto que son los que mejor f1 score poseen sobre los datos de testeo, y ademas porque segun el test de Tukey, no presentan diferencias significativas entre ellos.

# HiperParametrización

## Bagging

Grid Search

In [298]:
# Grid Search
from sklearn.model_selection import GridSearchCV

# Hiperparámetros para buscar
param_grid_BAG = {
    'n_estimators': [5, 10, 15],
    'max_samples':[0.6, 0.7, 0.8]
}

grid_search_BAG = GridSearchCV(modelBAG, param_grid_BAG, cv=10, scoring='f1', n_jobs=-1) #maximiza el scoring
grid_search_BAG.fit(X_train, Y_train) #70%

grid_best_model_BAG = grid_search_BAG.best_estimator_

Bayes Search

In [300]:
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-24.9.0-py3-none-any.whl.metadata (11 kB)
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyaml-24.9.0-py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-24.9.0 scikit-optimize-0.10.2


In [309]:
from skopt.space import Real, Integer
from skopt import BayesSearchCV

param_grid_BAG_bayes = {
    'n_estimators': Integer(5, 15),
    'max_samples':Real(0.6,0.8)
}

bayes_search_BAG = BayesSearchCV(
    estimator=modelBAG,
    search_spaces=param_grid_BAG_bayes,
    n_iter=20,       # Número de iteraciones de búsqueda
    cv=10,            # Número de divisiones en la validación cruzada
    n_jobs=-1,        # Utilizar todos los núcleos disponibles
    scoring='f1', #Se maximiza el scoring
    refit=True #el mejor modelo encontrado durante la búsqueda se vuelve a ajustar (re-entrenar) usando todos los dato
)

# Realizar la búsqueda de hiperparámetros
bayes_search_BAG.fit(X_train, Y_train) #70%

bayes_best_model_BAG = bayes_search_BAG.best_estimator_

## Hard Voting

grid search

In [304]:
# Hiperparámetros para buscar
param_grid_VH = {
    'dt__criterion': ['gini', 'entropy'],
    'knn__n_neighbors': [1, 3],
}

grid_search_VH = GridSearchCV(modelVH, param_grid_VH, cv=10, scoring='f1', n_jobs=-1) #maximiza el scoring
grid_search_VH.fit(X_train, Y_train) #70%

grid_best_model_VH = grid_search_VH.best_estimator_

Bayes Search

In [308]:
param_grid_VH_bayes = {
    'dt__criterion': ['gini', 'entropy'],
    'knn__n_neighbors': Integer(1,3),
}
bayes_search_VH = BayesSearchCV(
    estimator=modelVH,
    search_spaces=param_grid_VH_bayes,
    n_iter=20,       # Número de iteraciones de búsqueda
    cv=10,            # Número de divisiones en la validación cruzada
    n_jobs=-1,        # Utilizar todos los núcleos disponibles
    scoring='f1', #Se maximiza el scoring
    refit=True #el mejor modelo encontrado durante la búsqueda se vuelve a ajustar (re-entrenar) usando todos los dato
)

# Realizar la búsqueda de hiperparámetros
bayes_search_VH.fit(X_train, Y_train) #70%

bayes_best_model_VH = bayes_search_VH.best_estimator_



## Random Forest

Grid Search

In [306]:
# Hiperparámetros para buscar
param_grid_RF = {
    'n_estimators': [50, 75, 100],
    'criterion': ['gini', 'entropy']
}
grid_search_RF = GridSearchCV(modelRF, param_grid_RF, cv=10, scoring='f1', n_jobs=-1) #maximiza el scoring
grid_search_RF.fit(X_train, Y_train) #70%

grid_best_model_RF = grid_search_RF.best_estimator_

Bayes Search

In [310]:
param_grid_RF_bayes = {
    'n_estimators': Integer(50, 100),
    'criterion': ['gini', 'entropy']
}

bayes_search_RF = BayesSearchCV(
    estimator=modelRF,
    search_spaces=param_grid_RF_bayes,
    n_iter=20,       # Número de iteraciones de búsqueda
    cv=10,            # Número de divisiones en la validación cruzada
    n_jobs=-1,        # Utilizar todos los núcleos disponibles
    scoring='f1', #Se maximiza el scoring
    refit=True #el mejor modelo encontrado durante la búsqueda se vuelve a ajustar (re-entrenar) usando todos los dato
)

# Realizar la búsqueda de hiperparámetros
bayes_search_RF.fit(X_train, Y_train) #70%

bayes_best_model_RF = bayes_search_RF.best_estimator_



Evaluamos los modelos entrenados por GridSearch y BayesSearch

In [314]:
f1_final_scores = pd.DataFrame(index=["F1"])

In [315]:
#Calculamos metricas
Y_pred = grid_best_model_BAG.predict(X_test)
f1_final_scores["BAG_grid"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = bayes_best_model_BAG.predict(X_test)
f1_final_scores["BAG_bayes"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = grid_best_model_VH.predict(X_test)
f1_final_scores["VH_grid"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = bayes_best_model_VH.predict(X_test)
f1_final_scores["VH_bayes"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = grid_best_model_RF.predict(X_test)
f1_final_scores["RF_grid"] = metrics.f1_score(Y_test,Y_pred)

Y_pred = bayes_best_model_RF.predict(X_test)
f1_final_scores["RF_bayes"] = metrics.f1_score(Y_test,Y_pred)

In [317]:
f1_final_scores

Unnamed: 0,BAG_grid,BAG_bayes,VH_grid,VH_bayes,RF_grid,RF_bayes
F1,0.369565,0.382022,0.390093,0.385542,0.381295,0.381295


Luego de evaluar cada modelo sobre los datos de testeo, el modelo que mejor nos entrega un f1 score es el encontrado por GridSearch en Hard Voting.

## Exportamos, labelEncoder, Pipeline, Modelo

In [321]:
X.columns._values

array(['art_nombre', 'art_vr_reposicion', 'cco_nombre'], dtype=object)

In [324]:
import pickle
filename = 'modeloFinal_GaviEquipos.pkl'
variables=X.columns._values
pickle.dump([grid_best_model_VH, pipe, labelencoder, variables], open(filename, 'wb'))

In [336]:
print("holaaaa")

holaaaa
