# Clasificación Stacking

En este cuaderno empleamos el ensemble de `Stacking` para realizar predicciones sobre el conjunto de transacciones bancarias de Kaggle. Para ello, importaremos `StackingClassifier` de sklearn, así como otros clasificadores elementales y `GridSearchCV` para optimizar los hiperparámetros de los modelos usando esquemas de validación cruzada.

## Cargamos los datos

In [88]:
import pandas as pd

data_path = './data/'

train_data = pd.read_csv(f'{data_path}train_data.csv')
test_data = pd.read_csv(f'{data_path}test_data.csv')

df_reduce_mrmr = pd.read_csv(f'{data_path}X_train_reduce_mrmr.csv')
df_reduce_mrmr_instances = pd.read_csv(f'{data_path}df_reduce_mrmr_instances.csv')
df_reduce_mrmr_instances_hard = pd.read_csv(f'{data_path}df_reduce_mrmr_instances_hard.csv')

df_X_train_reduce_RFC = pd.read_csv(f'{data_path}df_X_train_reduce_RFC.csv')
df_reduce_RFC_instances = pd.read_csv(f'{data_path}df_reduce_RFC_instances.csv')
df_reduce_RFC_instances_hard = pd.read_csv(f'{data_path}df_reduce_RFC_instances_hard.csv')

print("Datos cargados exitosamente:")
print(f"train_data: {train_data.shape}")
print(train_data["Class"].value_counts())
print(f"df_reduce_mrmr: {df_reduce_mrmr.shape}")
print(df_reduce_mrmr["Class"].value_counts())
print(f"df_reduce_mrmr_instances: {df_reduce_mrmr_instances.shape}")
print(df_reduce_mrmr_instances["Class"].value_counts())
print(f"df_reduce_mrmr_instances hard: {df_reduce_mrmr_instances_hard.shape}")
print(df_reduce_mrmr_instances_hard["Class"].value_counts())
print(f"df_X_train_reduce_RFC: {df_X_train_reduce_RFC.shape}")
print(df_X_train_reduce_RFC["Class"].value_counts())
print(f"df_reduce_RFC_instances: {df_reduce_RFC_instances.shape}")
print(df_reduce_RFC_instances["Class"].value_counts())
print(f"df_reduce_RFC_instances hard: {df_reduce_RFC_instances_hard.shape}")
print(df_reduce_RFC_instances_hard["Class"].value_counts())

Datos cargados exitosamente:
train_data: (256326, 31)
Class
0    255883
1       443
Name: count, dtype: int64
df_reduce_mrmr: (256326, 11)
Class
0    255883
1       443
Name: count, dtype: int64
df_reduce_mrmr_instances: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_reduce_mrmr_instances hard: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_X_train_reduce_RFC: (256326, 11)
Class
0    255883
1       443
Name: count, dtype: int64
df_reduce_RFC_instances: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_reduce_RFC_instances hard: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64


## mrMR + clusterCentroids_hard

### Estimador final: LogisticRegression

In [91]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier

X = df_reduce_mrmr_instances_hard.drop(columns=['Class'])
y = df_reduce_mrmr_instances_hard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Dimensiones de los conjuntos:")
print(f"Conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
print(f"Conjunto de prueba: {X_test.shape}, {y_test.shape}")

estimadores = [('knn', KNeighborsClassifier()),
               ('svm', SVC(random_state = 42, class_weight = 'balanced'))]

sclf = StackingClassifier(estimators = estimadores , final_estimator = LogisticRegression())

parametros = {'knn__n_neighbors': [5],
              'svm__C': [100], 'svm__kernel': ['poly'], 'svm__degree': [2]}

grid = GridSearchCV(estimator = sclf, param_grid = parametros, cv=5, scoring='f1', verbose=3)
grid.fit(X_train, y_train)

y_pred = grid.best_estimator_.predict(X_test)
print(f"Mejores parámetros: {grid.best_params_}")
print(f"Precisión: {accuracy_score(y_test, y_pred):.2f}")

Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END knn__n_neighbors=5, svm__C=100, svm__degree=2, svm__kernel=poly;, score=0.885 total time=   0.2s
[CV 2/5] END knn__n_neighbors=5, svm__C=100, svm__degree=2, svm__kernel=poly;, score=0.889 total time=   0.1s
[CV 3/5] END knn__n_neighbors=5, svm__C=100, svm__degree=2, svm__kernel=poly;, score=0.866 total time=   0.1s
[CV 4/5] END knn__n_neighbors=5, svm__C=100, svm__degree=2, svm__kernel=poly;, score=0.891 total time=   0.1s
[CV 5/5] END knn__n_neighbors=5, svm__C=100, svm__degree=2, svm__kernel=poly;, score=0.894 total time=   0.1s
Mejores parámetros: {'knn__n_neighbors': 5, 'svm__C': 100, 'svm__degree': 2, 'svm__kernel': 'poly'}
Precisión: 0.90


### Validación con el conjunto de test

In [92]:
from sklearn.preprocessing import MinMaxScaler

# Cogemos los datos de test y les eliminamos las características que no necesitamos
X_test = test_data.drop(columns=['Class'])
y_test_final = test_data['Class']
columns_to_keep_mrmr = ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23']
X_test_reduce = X_test[columns_to_keep_mrmr]

#normalizamos la entrada
scaler = MinMaxScaler()
# Normalizar las columnas
X_test_reduce['Amount'] = scaler.fit_transform(X_test_reduce[['Amount']])
X_test_reduce['Time'] = scaler.fit_transform(X_test_reduce[['Time']])

print(y_test_final.value_counts())

# Realizar predicciones en el conjunto de prueba
y_pred = grid.best_estimator_.predict(X_test_reduce)

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test_final, y_pred)
report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])
# Mostrar la matriz de confusión
print("Matriz de confusión:")
print(conf_matrix)

print("Reporte de Clasificación:")
print(report)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_reduce['Amount'] = scaler.fit_transform(X_test_reduce[['Amount']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_reduce['Time'] = scaler.fit_transform(X_test_reduce[['Time']])


Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[27854   578]
 [    6    43]]
Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.98      0.99     28432
Fraudulentas       0.07      0.88      0.13        49

    accuracy                           0.98     28481
   macro avg       0.53      0.93      0.56     28481
weighted avg       1.00      0.98      0.99     28481



### Estimador final: KNN

In [94]:
X = df_reduce_mrmr_instances_hard.drop(columns=['Class'])
y = df_reduce_mrmr_instances_hard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

estimadores = [('knn', KNeighborsClassifier()),
               ('svm', SVC(random_state = 42, class_weight = 'balanced'))]

sclf = StackingClassifier(estimators = estimadores , final_estimator = KNeighborsClassifier())

parametros = {'knn__n_neighbors': [3,5,7],
              'svm__C': [0.1,1,100], 'svm__kernel': ['poly','linear'], 'svm__degree': [2,3]}

grid = GridSearchCV(estimator = sclf, param_grid = parametros, cv=5, scoring='f1', verbose=3)
grid.fit(X_train, y_train)

y_pred = grid.best_estimator_.predict(X_test)

print(f"Mejores parámetros: {grid.best_params_}")
print(f"Precisión: {accuracy_score(y_test, y_pred):.2f}")

# Cogemos los datos de test y les eliminamos las características que no necesitamos
X_test = test_data.drop(columns=['Class'])
y_test_final = test_data['Class']
columns_to_keep_mrmr = ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23']
X_test_reduce = X_test[columns_to_keep_mrmr]

#normalizamos la entrada
scaler = MinMaxScaler()
# Normalizar las columnas
X_test_reduce['Amount'] = scaler.fit_transform(X_test_reduce[['Amount']])
X_test_reduce['Time'] = scaler.fit_transform(X_test_reduce[['Time']])

# Realizar predicciones en el conjunto de prueba
y_pred = grid.best_estimator_.predict(X_test_reduce)
# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test_final, y_pred)
report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])
# Mostrar la matriz de confusión
print("Matriz de confusión:")
print(conf_matrix)

print("Reporte de Clasificación:")
print(report)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.899 total time=   0.1s
[CV 2/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.903 total time=   0.1s
[CV 3/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.861 total time=   0.1s
[CV 4/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.847 total time=   0.1s
[CV 5/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.900 total time=   0.1s
[CV 1/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=linear;, score=0.881 total time=   0.1s
[CV 2/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=linear;, score=0.899 total time=   0.1s
[CV 3/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=linear;, score=0.855 total time=   0.1s
[CV 4/5] END knn__n_neighbors=3, svm__C=0.1, svm__de

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_reduce['Amount'] = scaler.fit_transform(X_test_reduce[['Amount']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_reduce['Time'] = scaler.fit_transform(X_test_reduce[['Time']])


Matriz de confusión:
[[26110  2322]
 [    4    45]]
Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.92      0.96     28432
Fraudulentas       0.02      0.92      0.04        49

    accuracy                           0.92     28481
   macro avg       0.51      0.92      0.50     28481
weighted avg       1.00      0.92      0.96     28481



### Estimador final: árbol de decisión

In [95]:
from sklearn.tree import DecisionTreeClassifier

X = df_reduce_mrmr_instances_hard.drop(columns=['Class'])
y = df_reduce_mrmr_instances_hard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

estimadores = [('knn', KNeighborsClassifier()),
               ('svm', SVC(random_state = 42, class_weight = 'balanced'))]

sclf = StackingClassifier(estimators = estimadores , final_estimator = DecisionTreeClassifier(random_state = 42))

parametros = {'knn__n_neighbors': [3,5,7],
              'svm__C': [0.1,1,100], 'svm__kernel': ['poly','linear'], 'svm__degree': [2,3]}

grid = GridSearchCV(estimator = sclf, param_grid = parametros, cv=5, scoring='f1', verbose=3)
grid.fit(X_train, y_train)

y_pred = grid.best_estimator_.predict(X_test)

print(f"Mejores parámetros: {grid.best_params_}")
print(f"Precisión: {accuracy_score(y_test, y_pred):.2f}")

# Cogemos los datos de test y les eliminamos las características que no necesitamos
X_test = test_data.drop(columns=['Class'])
y_test_final = test_data['Class']
columns_to_keep_mrmr = ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23']
X_test_reduce = X_test[columns_to_keep_mrmr]

#normalizamos la entrada
scaler = MinMaxScaler()
# Normalizar las columnas
X_test_reduce['Amount'] = scaler.fit_transform(X_test_reduce[['Amount']])
X_test_reduce['Time'] = scaler.fit_transform(X_test_reduce[['Time']])

# Realizar predicciones en el conjunto de prueba
y_pred = grid.best_estimator_.predict(X_test_reduce)
# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test_final, y_pred)
report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])
# Mostrar la matriz de confusión
print("Matriz de confusión:")
print(conf_matrix)

print("Reporte de Clasificación:")
print(report)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.861 total time=   0.1s
[CV 2/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.875 total time=   0.1s
[CV 3/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.797 total time=   0.1s
[CV 4/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.795 total time=   0.1s
[CV 5/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.824 total time=   0.1s
[CV 1/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=linear;, score=0.824 total time=   0.1s
[CV 2/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=linear;, score=0.842 total time=   0.1s
[CV 3/5] END knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=linear;, score=0.814 total time=   0.1s
[CV 4/5] END knn__n_neighbors=3, svm__C=0.1, svm__de

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_reduce['Amount'] = scaler.fit_transform(X_test_reduce[['Amount']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_reduce['Time'] = scaler.fit_transform(X_test_reduce[['Time']])


Matriz de confusión:
[[23424  5008]
 [    5    44]]
Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.82      0.90     28432
Fraudulentas       0.01      0.90      0.02        49

    accuracy                           0.82     28481
   macro avg       0.50      0.86      0.46     28481
weighted avg       1.00      0.82      0.90     28481



## RFC + clusterCentroids_hard

### Estimador final: LogisticRegression

In [97]:
X = df_reduce_RFC_instances_hard.drop(columns=['Class'])
y = df_reduce_RFC_instances_hard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Dimensiones de los conjuntos:")
print(f"Conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
print(f"Conjunto de prueba: {X_test.shape}, {y_test.shape}")

estimadores = [('knn', KNeighborsClassifier()),
               ('svm', SVC(random_state = 42, class_weight = 'balanced')),
               ('dt', DecisionTreeClassifier(class_weight='balanced', random_state=42))]

sclf = StackingClassifier(estimators = estimadores , final_estimator = LogisticRegression())

parametros = {'knn__n_neighbors': [3,5,7],
              'svm__C': [0.1,1,100], 'svm__kernel': ['poly','linear'], 'svm__degree': [2,3],
              'dt__max_depth' : [5,7],'dt__min_samples_split': [2], 'dt__min_samples_leaf': [4]}

grid = GridSearchCV(estimator = sclf, param_grid = parametros, cv=5, scoring='f1', verbose=3)
grid.fit(X_train, y_train)

y_pred = grid.best_estimator_.predict(X_test)
print(f"Mejores parámetros: {grid.best_params_}")
print(f"Precisión: {accuracy_score(y_test, y_pred):.2f}")

Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV 1/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.913 total time=   0.1s
[CV 2/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.906 total time=   0.1s
[CV 3/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.904 total time=   0.1s
[CV 4/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.928 total time=   0.1s
[CV 5/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degre

### Validación con el conjunto de test

In [99]:
# Cogemos los datos de test y les eliminamos las características que no necesitamos
X_test = test_data.drop(columns=['Class'])
y_test_final = test_data['Class']
columns_to_keep_RFC = ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7']
X_test_reduce = X_test[columns_to_keep_RFC]

print(y_test_final.value_counts())

# Realizar predicciones en el conjunto de prueba
y_pred = grid.best_estimator_.predict(X_test_reduce)

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test_final, y_pred)
report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])
# Mostrar la matriz de confusión
print("Matriz de confusión:")
print(conf_matrix)

print("Reporte de Clasificación:")
print(report)

Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[28104   328]
 [    5    44]]
Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.99      0.99     28432
Fraudulentas       0.12      0.90      0.21        49

    accuracy                           0.99     28481
   macro avg       0.56      0.94      0.60     28481
weighted avg       1.00      0.99      0.99     28481



### Estimador final: KNN

In [100]:
X = df_reduce_RFC_instances_hard.drop(columns=['Class'])
y = df_reduce_RFC_instances_hard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Dimensiones de los conjuntos:")
print(f"Conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
print(f"Conjunto de prueba: {X_test.shape}, {y_test.shape}")

estimadores = [('knn', KNeighborsClassifier()),
               ('svm', SVC(random_state = 42, class_weight = 'balanced')),
               ('dt', DecisionTreeClassifier(class_weight='balanced', random_state=42))]

sclf = StackingClassifier(estimators = estimadores , final_estimator = KNeighborsClassifier())

parametros = {'knn__n_neighbors': [3,5,7],
              'svm__C': [0.1,1,100], 'svm__kernel': ['poly','linear'], 'svm__degree': [2,3],
              'dt__max_depth' : [5,7],'dt__min_samples_split': [2], 'dt__min_samples_leaf': [4]}

grid = GridSearchCV(estimator = sclf, param_grid = parametros, cv=5, scoring='f1', verbose=3)
grid.fit(X_train, y_train)

y_pred = grid.best_estimator_.predict(X_test)
print(f"Mejores parámetros: {grid.best_params_}")
print(f"Precisión: {accuracy_score(y_test, y_pred):.2f}")

Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV 1/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.904 total time=   0.1s
[CV 2/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.909 total time=   0.1s
[CV 3/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.882 total time=   0.1s
[CV 4/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.934 total time=   0.1s
[CV 5/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degre

### Validación con el conjunto de test

In [102]:
# Cogemos los datos de test y les eliminamos las características que no necesitamos
X_test = test_data.drop(columns=['Class'])
y_test_final = test_data['Class']
columns_to_keep_RFC = ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7']
X_test_reduce = X_test[columns_to_keep_RFC]

print(y_test_final.value_counts())

# Realizar predicciones en el conjunto de prueba
y_pred = grid.best_estimator_.predict(X_test_reduce)

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test_final, y_pred)
report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])
# Mostrar la matriz de confusión
print("Matriz de confusión:")
print(conf_matrix)

print("Reporte de Clasificación:")
print(report)

Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[27620   812]
 [    5    44]]
Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.97      0.99     28432
Fraudulentas       0.05      0.90      0.10        49

    accuracy                           0.97     28481
   macro avg       0.53      0.93      0.54     28481
weighted avg       1.00      0.97      0.98     28481



### Estimador final: Árbol de decisión

In [103]:
X = df_reduce_RFC_instances_hard.drop(columns=['Class'])
y = df_reduce_RFC_instances_hard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Dimensiones de los conjuntos:")
print(f"Conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
print(f"Conjunto de prueba: {X_test.shape}, {y_test.shape}")

estimadores = [('knn', KNeighborsClassifier()),
               ('svm', SVC(random_state = 42, class_weight = 'balanced')),
               ('dt', DecisionTreeClassifier(class_weight='balanced', random_state=42))]

sclf = StackingClassifier(estimators = estimadores , final_estimator = DecisionTreeClassifier(random_state = 42, class_weight = 'balanced'))

parametros = {'knn__n_neighbors': [3,5,7],
              'svm__C': [0.1,1,100], 'svm__kernel': ['poly','linear'], 'svm__degree': [2,3],
              'dt__max_depth' : [5,7],'dt__min_samples_split': [2], 'dt__min_samples_leaf': [4]}

grid = GridSearchCV(estimator = sclf, param_grid = parametros, cv=5, scoring='f1', verbose=3)
grid.fit(X_train, y_train)

y_pred = grid.best_estimator_.predict(X_test)
print(f"Mejores parámetros: {grid.best_params_}")
print(f"Precisión: {accuracy_score(y_test, y_pred):.2f}")

Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV 1/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.819 total time=   0.1s
[CV 2/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.871 total time=   0.1s
[CV 3/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.857 total time=   0.1s
[CV 4/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degree=2, svm__kernel=poly;, score=0.899 total time=   0.1s
[CV 5/5] END dt__max_depth=5, dt__min_samples_leaf=4, dt__min_samples_split=2, knn__n_neighbors=3, svm__C=0.1, svm__degre

### Validación con el conjunto de test

In [104]:
# Cogemos los datos de test y les eliminamos las características que no necesitamos
X_test = test_data.drop(columns=['Class'])
y_test_final = test_data['Class']
columns_to_keep_RFC = ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7']
X_test_reduce = X_test[columns_to_keep_RFC]

print(y_test_final.value_counts())

# Realizar predicciones en el conjunto de prueba
y_pred = grid.best_estimator_.predict(X_test_reduce)

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test_final, y_pred)
report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])
# Mostrar la matriz de confusión
print("Matriz de confusión:")
print(conf_matrix)

print("Reporte de Clasificación:")
print(report)

Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[24046  4386]
 [    5    44]]
Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.85      0.92     28432
Fraudulentas       0.01      0.90      0.02        49

    accuracy                           0.85     28481
   macro avg       0.50      0.87      0.47     28481
weighted avg       1.00      0.85      0.91     28481



## Conclusiones

Si bien los clasificadores no son del todo óptimos, se observa una mejora apreciable en las predicciones respecto de los árboles de decisión, siendo que `SVM` y `KNN` han sido capaces de compensar en cierta medida, y sobre todo para los datasets con el filtrado `ClusterCetroidsHard`, la incapacidad de estos primeros de reconocer una transacción no fraudulenta, lo cual refleja el poder que ofrecen los ensembles. Asimismo, se podían haber creado ensembles cuyos propios estimadores sean a su vez ensembles. Este procedimiento podría resultar más efectivo para construir un clasificador que responda adecuadamente a las particularidades del problema.