<img src="https://upload.wikimedia.org/wikipedia/commons/d/df/Logo_UNIR.png" width="350" height="175">

# *TFM: Comparación y optimización de algoritmos de Machine Learning sobre el éxito de campañas de marketing bancarias*

Autor: ***Jorge López Pérez***

***

## ***8. Tratamiento del desbalanceo de clases***

A lo largo de este cuaderno, exploraremos las diferentes opciones que tenemos a la hora de afrontar el desbalanceo de clases presente en nuestro conjunto de datos.

In [None]:
!pip install scikit-learn==1.2.2 #instalamos la versión 1.2.2 en este cuaderno debido a incompatibilidades con la librería imbalanced-learn en versiones superiores

In [None]:
!pip install scikeras

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
import warnings
import time
import matplotlib.pyplot as plt
import seaborn as sns

#modelos
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from tensorflow import keras
from sklearn.ensemble import RandomForestClassifier
import xgboost
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from scikeras.wrappers import KerasClassifier

#metricas
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

#encoders
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder

#imputers
from sklearn.impute import KNNImputer

#escalers
from sklearn.preprocessing import StandardScaler

#técnicas desbalanceo de clases
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

warnings.filterwarnings("ignore")

In [None]:
# leemos los datos limpios
data = pd.read_csv('https://raw.githubusercontent.com/JorgeLopez88/TFM/main/data/clean_data.csv')
data.shape

(41176, 21)

Separamos los datos en train y test que utilizaremos hasta el final del estudio (utilizaremos stratify para tener el mismo porcentaje de instancias de cada clase en train y test):

In [None]:
x,y = data.drop(['y'], axis=1), data['y'].copy()
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.15, stratify=y, random_state=44)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(34999, 20)
(34999,)
(6177, 20)
(6177,)


***

#### 8.1 Pipeline

In [None]:
numericas = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
categoricas_nominal = ['job','marital', 'poutcome']
categoricas_ordinal = ['education', 'month', 'day_of_week']
categoricas_dicotomicas = ['default', 'housing', 'loan', 'contact']

categoricas_nominal_oh = ['job_housemaid','job_services','job_admin.','job_blue-collar','job_technician','job_retired','job_management','job_unemployed','job_self-employed','job_entrepreneur','job_student',
                          'marital_married', 'marital_single', 'marital_divorced',
                          'poutcome_nonexistent', 'poutcome_failure', 'poutcome_success']
categoricas_dicotomicas_oh = ['default_no', 'default_yes',
                              'housing_no', 'housing_yes',
                              'loan_no', 'loan_yes',
                              'contact_cellular', 'contact_telephone']

total_columns_after_transform = categoricas_ordinal + categoricas_nominal_oh + categoricas_dicotomicas_oh + numericas

def round_imputed_values(X):
    X_rounded = np.round(X)
    return X_rounded

def get_dataframe(X):
  df = pd.DataFrame(X, columns=total_columns_after_transform)
  return df

def drop_features(X):
  return X.drop(['job_housemaid', 'job_unemployed', 'job_student', 'default_no', 'default_yes'],axis=1)

def get_array(X):
  return np.array(X)

cat_ordinal_transformer = Pipeline([
    ('encoder', OrdinalEncoder(categories=[
         ["basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree"],
         ['mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'],
         ["mon","tue","wed","thu","fri"]]
     , handle_unknown='use_encoded_value', unknown_value=np.nan)),
    ('imputer', KNNImputer(n_neighbors=10)),
    ('rounding', FunctionTransformer(round_imputed_values))
])

cat_nominal_transformer = Pipeline([
     ('encoder_prev', OrdinalEncoder(categories=[
         ['housemaid','services','admin.','blue-collar','technician','retired','management','unemployed','self-employed','entrepreneur','student'],
         ['married', 'single', 'divorced'],
         ['nonexistent', 'failure', 'success']
         ]
     , handle_unknown='use_encoded_value', unknown_value=np.nan)),
     ('imputer', KNNImputer(n_neighbors=10)),
     ('rounding', FunctionTransformer(round_imputed_values)),
     ('encoder', OneHotEncoder())
])

cat_dicotomico_transformer = Pipeline([
     ('encoder_prev', OrdinalEncoder(categories=[
         ['no', 'yes'],
         ['no', 'yes'],
         ['no', 'yes'],
         ['cellular', 'telephone']
         ]
     , handle_unknown='use_encoded_value', unknown_value=np.nan)),
     ('imputer', KNNImputer(n_neighbors=10)),
     ('rounding', FunctionTransformer(round_imputed_values)),
     ('encoder', OneHotEncoder())
])

numericas_transformer = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
        ('cat_ordinal', cat_ordinal_transformer, categoricas_ordinal),
        ('cat_nominal', cat_nominal_transformer, categoricas_nominal),
        ('cat_dicotomico', cat_dicotomico_transformer, categoricas_dicotomicas),
        ('numericas', numericas_transformer, numericas)
    ], remainder='passthrough')

pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('get_df', FunctionTransformer(get_dataframe)),
        ('drop_features', FunctionTransformer(drop_features)),
        ('get_array', FunctionTransformer(get_array))
    ])

In [None]:
label_encoder = LabelEncoder()
y_train_prepared = label_encoder.fit_transform(y_train)
y_test_prepared = label_encoder.transform(y_test)
print(y_train_prepared.shape)
print(y_test_prepared.shape)

(34999,)
(6177,)


In [None]:
x_train_prepared = pipeline.fit_transform(x_train)
x_test_prepared = pipeline.transform(x_test)
print(x_train_prepared.shape)
print(x_test_prepared.shape)

(34999, 33)
(6177, 33)


***

#### 8.2 Modelos y cálculo de métricas

Definimos la arquitectura inicial de nuestra RN. Empezaremos con una arquitectura sencilla que optimizaremos cuando llegue el momento:

In [None]:
def build_rn():
  inputs = keras.Input(shape=(x_train_prepared.shape[1],))

  hidden1 = keras.layers.Dense(32, activation='relu')(inputs)
  hidden2 = keras.layers.Dense(16, activation='relu')(hidden1)
  outputs = keras.layers.Dense(1, activation='sigmoid')(hidden2)

  model = keras.Model(inputs=inputs, outputs=outputs)
  model.compile(optimizer='adam', loss='binary_crossentropy')
  return model

In [None]:
def initialize_models(class_weights_balanced=False):
  if class_weights_balanced:
    estimators = [
        ('lr', LogisticRegression(random_state=44, class_weight='balanced')),
        ('dt', DecisionTreeClassifier(random_state=44, class_weight='balanced')),
        ('kn', KNeighborsClassifier()),
        ('rn', KerasClassifier(model=build_rn, epochs=10, batch_size=32, verbose=0, random_state=44, class_weight='balanced')),
        ('rf', RandomForestClassifier(random_state=44, class_weight='balanced')),
        ('xgb', xgboost.XGBClassifier(random_state=44, scale_pos_weight=7.88)), #scale_pos_weight = sum(negatives) / sum(positives)
        ('hist', HistGradientBoostingClassifier(random_state=44, class_weight='balanced'))
    ]
  else:
    estimators = [
        ('lr', LogisticRegression(random_state=44)),
        ('dt', DecisionTreeClassifier(random_state=44)),
        ('kn', KNeighborsClassifier()),
        ('rn', KerasClassifier(model=build_rn, epochs=10, batch_size=32, verbose=0, random_state=44)),
        ('rf', RandomForestClassifier(random_state=44)),
        ('xgb', xgboost.XGBClassifier(random_state=44)),
        ('hist', HistGradientBoostingClassifier(random_state=44))
    ]

  dict_estimators = dict(estimators)

  voting = VotingClassifier(estimators=estimators, voting='soft')
  dict_estimators['voting'] = voting

  return dict_estimators

Funciones para cálculo de métricas:

In [None]:
def probas_to_abs(probas, umbral=0.5):
  abs = []
  for proba in probas:
    if (proba >= umbral): abs.append(1)
    else: abs.append(0)
  return abs

def calculate_cv_metrics(estimator,x,y, sampling_method):
  skf = StratifiedKFold(n_splits=5, shuffle=True,random_state=44)

  scores_auc_pr = []
  scores_precision = []
  scores_recall = []
  scores_f1 = []
  scores_accuracy = []
  scores_roc = []

  for i, (train_idx, test_idx) in enumerate(skf.split(x, y)):

    X_train, X_test = x[train_idx], x[test_idx]
    Y_train, Y_test = y[train_idx], y[test_idx]

    if (sampling_method!=None):
      X_train, Y_train = sampling_method.fit_resample(X_train, Y_train) #aplicamos la técnica de sampling SOLO sobre el conjunto de entrenamiento

    model = clone(estimator)
    model.fit(X_train, Y_train) #ajustamos el modelo con los datos de entrenamiento remuestreados (si se pasa un sampling_method)
    preds_test = model.predict_proba(X_test)[:,1]

    scores_auc_pr.append(average_precision_score(Y_test, preds_test))
    scores_precision.append(precision_score(Y_test, probas_to_abs(preds_test)))
    scores_recall.append(recall_score(Y_test, probas_to_abs(preds_test)))
    scores_f1.append(f1_score(Y_test, probas_to_abs(preds_test)))
    scores_accuracy.append(accuracy_score(Y_test, probas_to_abs(preds_test)))
    scores_roc.append(roc_auc_score(Y_test, preds_test))

  mean_auc_pr = round(np.mean(scores_auc_pr), 4)
  mean_precision = round(np.mean(scores_precision), 4)
  mean_recall = round(np.mean(scores_recall), 4)
  mean_f1 = round(np.mean(scores_f1), 4)
  mean_accuracy = round(np.mean(scores_accuracy), 4)
  mean_roc = round(np.mean(scores_roc), 4)

  return mean_auc_pr, mean_precision, mean_recall, mean_f1, mean_accuracy, mean_roc


def calculate_pr_cv(sampling_method, class_weights_balanced=False):
  if class_weights_balanced:
    modelos = initialize_models(class_weights_balanced=True)
  else:
    modelos = initialize_models()

  scores = []
  for key, model in modelos.items():
      score,_,_,_,_,_ = calculate_cv_metrics(model, x_train_prepared, y_train_prepared, sampling_method=sampling_method)
      scores.append(score)
      print(f'CV -> AUC PR score for {key}: ', score)
  print('')
  print('Mean AUC PR score: ', round(np.mean(scores), 4))


def calculate_baseline_metrics_cv(sampling_method, class_weights_balanced=False):
  if class_weights_balanced:
    modelos = initialize_models(class_weights_balanced=True)
  else:
    modelos = initialize_models()

  for key, model in modelos.items():
      auc_pr, precision, recall, f1,_,_ = calculate_cv_metrics(model, x_train_prepared, y_train_prepared, sampling_method=sampling_method)
      print(f'CV -> AUC PR score for {key}: ', auc_pr)
      print(f'CV -> Precision score for {key}: ', precision)
      print(f'CV -> Recall score for {key}: ', recall)
      print(f'CV -> F1 score for {key}: ', f1)
      print('')
      print('**********************************')
      print('')

***

#### 8.3 Opciones tratamiento del desbalanceo de clase

##### 8.3.1 Técnicas de Under-sampling

***OPCIÓN 1: Random Under Sampling***

In [None]:
calculate_pr_cv(sampling_method=RandomUnderSampler(random_state=44))

CV -> AUC PR score for lr:  0.577
CV -> AUC PR score for dt:  0.3465
CV -> AUC PR score for kn:  0.4573
CV -> AUC PR score for rn:  0.6113
CV -> AUC PR score for rf:  0.6095
CV -> AUC PR score for xgb:  0.616
CV -> AUC PR score for hist:  0.6303
CV -> AUC PR score for voting:  0.6379

Mean AUC PR score:  0.5607


***OPCIÓN 2: TomekLinks***

In [None]:
calculate_pr_cv(sampling_method=TomekLinks())

CV -> AUC PR score for lr:  0.5884
CV -> AUC PR score for dt:  0.327
CV -> AUC PR score for kn:  0.4804
CV -> AUC PR score for rn:  0.6335
CV -> AUC PR score for rf:  0.6416
CV -> AUC PR score for xgb:  0.6461
CV -> AUC PR score for hist:  0.6671
CV -> AUC PR score for voting:  0.6543

Mean AUC PR score:  0.5798


##### 8.3.2 Técnicas de Over-sampling

***OPCIÓN 3: Random Over Sampling***

In [None]:
calculate_pr_cv(sampling_method=RandomOverSampler(random_state=44))

CV -> AUC PR score for lr:  0.5768
CV -> AUC PR score for dt:  0.3182
CV -> AUC PR score for kn:  0.4034
CV -> AUC PR score for rn:  0.622
CV -> AUC PR score for rf:  0.6326
CV -> AUC PR score for xgb:  0.6398
CV -> AUC PR score for hist:  0.6672
CV -> AUC PR score for voting:  0.6457

Mean AUC PR score:  0.5632


***OPCIÓN 4: SMOTE***

In [None]:
calculate_pr_cv(sampling_method=SMOTE(random_state=44))

CV -> AUC PR score for lr:  0.5743
CV -> AUC PR score for dt:  0.3272
CV -> AUC PR score for kn:  0.4264
CV -> AUC PR score for rn:  0.6114
CV -> AUC PR score for rf:  0.6214
CV -> AUC PR score for xgb:  0.6382
CV -> AUC PR score for hist:  0.6544
CV -> AUC PR score for voting:  0.642

Mean AUC PR score:  0.5619


##### 8.3.3 Técnicas de combinación over y under sampling

***OPCIÓN 5: SMOTETomek***

In [None]:
calculate_pr_cv(sampling_method=SMOTETomek(random_state=44))

CV -> AUC PR score for lr:  0.5741
CV -> AUC PR score for dt:  0.3284
CV -> AUC PR score for kn:  0.4262
CV -> AUC PR score for rn:  0.616
CV -> AUC PR score for rf:  0.6199
CV -> AUC PR score for xgb:  0.6365
CV -> AUC PR score for hist:  0.6534
CV -> AUC PR score for voting:  0.6408

Mean AUC PR score:  0.5619


8.3.4 Enfoques a nivel de algoritmo

***OPCIÓN 6: Pesos de clases balanceados***

In [None]:
calculate_pr_cv(sampling_method=None, class_weights_balanced=True)

CV -> AUC PR score for lr:  0.5771
CV -> AUC PR score for dt:  0.3091
CV -> AUC PR score for kn:  0.478
CV -> AUC PR score for rn:  0.6247
CV -> AUC PR score for rf:  0.6416
CV -> AUC PR score for xgb:  0.6394
CV -> AUC PR score for hist:  0.6667
CV -> AUC PR score for voting:  0.6508

Mean AUC PR score:  0.5734


***

#### 8.4 Métricas CV mejor opción (OPCIÓN 2)


In [None]:
calculate_baseline_metrics_cv(sampling_method=TomekLinks())

CV -> AUC PR score for lr:  0.5884
CV -> Precision score for lr:  0.6475
CV -> Recall score for lr:  0.4545
CV -> F1 score for lr:  0.5335

**********************************

CV -> AUC PR score for dt:  0.327
CV -> Precision score for dt:  0.5003
CV -> Recall score for dt:  0.5526
CV -> F1 score for dt:  0.5251

**********************************

CV -> AUC PR score for kn:  0.4804
CV -> Precision score for kn:  0.5747
CV -> Recall score for kn:  0.4748
CV -> F1 score for kn:  0.5198

**********************************

CV -> AUC PR score for rn:  0.6335
CV -> Precision score for rn:  0.6191
CV -> Recall score for rn:  0.545
CV -> F1 score for rn:  0.5767

**********************************

CV -> AUC PR score for rf:  0.6416
CV -> Precision score for rf:  0.6281
CV -> Recall score for rf:  0.5727
CV -> F1 score for rf:  0.5989

**********************************

CV -> AUC PR score for xgb:  0.6461
CV -> Precision score for xgb:  0.6226
CV -> Recall score for xgb:  0.6008
CV -> F1 sc

***

Guardamos los datos finales transformados para los futuros apartados: (no aplicaremos ningún tratamiento para el desbalanceo de clases de los vistos, como se explica en la memoria):

In [None]:
pd.DataFrame(x_train_prepared).to_csv('x_train_prepared.csv', index=False)
pd.DataFrame(y_train_prepared).to_csv('y_train_prepared.csv', index=False)
pd.DataFrame(x_test_prepared).to_csv('x_test_prepared.csv', index=False)
pd.DataFrame(y_test_prepared).to_csv('y_test_prepared.csv', index=False)