# Clasificación Naive-Bayes

En primer lugar, importamos las librerías necesarias.

En cuanto al modelo de Naive-Bayes a usar, se ha elegido GaussianNB porque es la mejor opción para datos continuos que se distribuyen de forma normal, además de que es bastante robusto y suele ofrecer buenos resultados incluso en casos no ideales. Se impone a otros modelos como es el caso de:
- MultinomialNB ya que está diseñado para datos discretos
- BernoulliNB porque es específico para datos binarios (0 o 1), como presencia/ausencia de características.
- CategoricalNB ya que es específico para variables categóricas puras 
- ComplementNB por el mismo motivo que MultinomialNB

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

## Carga de datos

In [22]:
data_path = './data/'

train_data = pd.read_csv(f'{data_path}train_data.csv')
test_data = pd.read_csv(f'{data_path}test_data.csv')

df_reduce_mrmr = pd.read_csv(f'{data_path}X_train_reduce_mrmr.csv')
df_reduce_mrmr_instances = pd.read_csv(f'{data_path}df_reduce_mrmr_instances.csv')
df_reduce_mrmr_instances_hard = pd.read_csv(f'{data_path}df_reduce_mrmr_instances_hard.csv')
df_reduce_mrmr_instances_GLVQ = pd.read_csv(f'{data_path}df_reduce_mrmr_instances_GLVQ.csv')

df_X_train_reduce_RFC = pd.read_csv(f'{data_path}df_X_train_reduce_RFC.csv')
df_reduce_RFC_instances = pd.read_csv(f'{data_path}df_reduce_RFC_instances.csv')
df_reduce_RFC_instances_hard = pd.read_csv(f'{data_path}df_reduce_RFC_instances_hard.csv')
df_reduce_RFC_instances_GLVQ = pd.read_csv(f'{data_path}df_reduce_RFC_instances_GLVQ.csv')

print("Datos cargados exitosamente:")
print(f"train_data: {train_data.shape}")
print(train_data["Class"].value_counts())
print(f"df_reduce_mrmr: {df_reduce_mrmr.shape}")
print(df_reduce_mrmr["Class"].value_counts())
print(f"df_reduce_mrmr_instances: {df_reduce_mrmr_instances.shape}")
print(df_reduce_mrmr_instances["Class"].value_counts())
print(f"df_reduce_mrmr_instances hard: {df_reduce_mrmr_instances_hard.shape}")
print(df_reduce_mrmr_instances_hard["Class"].value_counts())
print(f"df_reduce_mrmr_instances_GLVQ: {df_reduce_mrmr_instances_GLVQ.shape}")
print(df_reduce_mrmr_instances_GLVQ["Class"].value_counts())
print(f"df_X_train_reduce_RFC: {df_X_train_reduce_RFC.shape}")
print(df_X_train_reduce_RFC["Class"].value_counts())
print(f"df_reduce_RFC_instances: {df_reduce_RFC_instances.shape}")
print(df_reduce_RFC_instances["Class"].value_counts())
print(f"df_reduce_RFC_instances hard: {df_reduce_RFC_instances_hard.shape}")
print(df_reduce_RFC_instances_hard["Class"].value_counts())
print(f"df_reduce_RFC_instances_GLVQ: {df_reduce_RFC_instances_GLVQ.shape}")
print(df_reduce_RFC_instances_GLVQ["Class"].value_counts())

Datos cargados exitosamente:
train_data: (256326, 31)
Class
0    255883
1       443
Name: count, dtype: int64
df_reduce_mrmr: (256326, 11)
Class
0    255883
1       443
Name: count, dtype: int64
df_reduce_mrmr_instances: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_reduce_mrmr_instances hard: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_reduce_mrmr_instances_GLVQ: (2, 11)
Class
0    1
1    1
Name: count, dtype: int64
df_X_train_reduce_RFC: (256326, 11)
Class
0    255883
1       443
Name: count, dtype: int64
df_reduce_RFC_instances: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_reduce_RFC_instances hard: (886, 11)
Class
0    443
1    443
Name: count, dtype: int64
df_reduce_RFC_instances_GLVQ: (2, 11)
Class
0    1
1    1
Name: count, dtype: int64


## Función para entrenar y evaluar Naive Bayes

In [29]:
def train_and_evaluate_naive_bayes(X, y, test_data, columns_to_keep):
    # División de datos
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    print("Dimensiones de los conjuntos:")
    print(f"Conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
    print(f"Conjunto de prueba: {X_test.shape}, {y_test.shape}")

    # Entrenamiento del modelo
    model = GaussianNB()
    model.fit(X_train, y_train)

    # Evaluación en el conjunto de prueba
    accuracy = model.score(X_test, y_test)
    print(f"Precisión en el conjunto de prueba: {accuracy:.2f}")

    # Preparar los datos de test final
    X_test_final = test_data[columns_to_keep]
    y_test_final = test_data['Class']

    scaler = MinMaxScaler()
    for col in ['Amount', 'Time']:
        if col in X_test_final.columns:
            X_test_final[col] = scaler.fit_transform(X_test_final[[col]])

    print(y_test_final.value_counts())

    # Predicciones en el conjunto de test final
    y_pred = model.predict(X_test_final)

    # Matriz de confusión y reporte de clasificación
    conf_matrix = confusion_matrix(y_test_final, y_pred)
    report = classification_report(y_test_final, y_pred, target_names=['Correctas', 'Fraudulentas'])

    print("Matriz de confusión:")
    print(conf_matrix)
    print("\nReporte de Clasificación:")
    print(report)

## Ejemplo con mRMR (ClusterCentroids_soft)

In [30]:
X = df_reduce_mrmr_instances.drop(columns=['Class'])
y = df_reduce_mrmr_instances['Class']
columns_to_keep_mrmr = ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23']

print("\n--- Evaluación con mRMR ClusterCentroids_soft ---")
train_and_evaluate_naive_bayes(X, y, test_data, columns_to_keep_mrmr)


--- Evaluación con mRMR ClusterCentroids_soft ---
Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Precisión en el conjunto de prueba: 0.60
Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[ 2197 26235]
 [    2    47]]

Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.08      0.14     28432
Fraudulentas       0.00      0.96      0.00        49

    accuracy                           0.08     28481
   macro avg       0.50      0.52      0.07     28481
weighted avg       1.00      0.08      0.14     28481



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])


## Ejemplo con RFC (ClusterCentroids_soft)

In [18]:
X = df_reduce_RFC_instances.drop(columns=['Class'])
y = df_reduce_RFC_instances['Class']
columns_to_keep_RFC = ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7']

print("\n--- Evaluación con RFC ClusterCentroids_soft ---")
train_and_evaluate_naive_bayes(X, y, test_data, columns_to_keep_RFC)


--- Evaluación con RFC ClusterCentroids_soft ---
Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Precisión en el conjunto de prueba: 0.89
Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[28362    70]
 [    7    42]]

Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      1.00      1.00     28432
Fraudulentas       0.38      0.86      0.52        49

    accuracy                           1.00     28481
   macro avg       0.69      0.93      0.76     28481
weighted avg       1.00      1.00      1.00     28481



## Ejemplo con mRMR (GLVQ)

In [19]:
X = df_reduce_mrmr_instances_GLVQ.drop(columns=['Class'])
y = df_reduce_mrmr_instances_GLVQ['Class']

print("\n--- Evaluación con mRMR GLVQ ---")
train_and_evaluate_naive_bayes(X, y, test_data, columns_to_keep_mrmr)


--- Evaluación con mRMR GLVQ ---
Dimensiones de los conjuntos:
Conjunto de entrenamiento: (1, 10), (1,)
Conjunto de prueba: (1, 10), (1,)


Precisión en el conjunto de prueba: 0.00
Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[28432     0]
 [   49     0]]

Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      1.00      1.00     28432
Fraudulentas       0.00      0.00      0.00        49

    accuracy                           1.00     28481
   macro avg       0.50      0.50      0.50     28481
weighted avg       1.00      1.00      1.00     28481



  n_ij = -0.5 * np.sum(np.log(2.0 * np.pi * self.var_[i, :]))
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_[i, :]), 1)
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_[i, :]), 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])
  n_ij = -0.5 * np.sum(np.log(2.0 * np.pi * self.var_[i, :]))
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_

## Ejemplo con RFC (GLVQ)

In [20]:
X = df_reduce_RFC_instances_GLVQ.drop(columns=['Class'])
y = df_reduce_RFC_instances_GLVQ['Class']

print("\n--- Evaluación con RFC GLVQ ---")
train_and_evaluate_naive_bayes(X, y, test_data, columns_to_keep_RFC)


--- Evaluación con RFC GLVQ ---
Dimensiones de los conjuntos:
Conjunto de entrenamiento: (1, 10), (1,)
Conjunto de prueba: (1, 10), (1,)
Precisión en el conjunto de prueba: 0.00
Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[28432     0]
 [   49     0]]

Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      1.00      1.00     28432
Fraudulentas       0.00      0.00      0.00        49

    accuracy                           1.00     28481
   macro avg       0.50      0.50      0.50     28481
weighted avg       1.00      1.00      1.00     28481



  n_ij = -0.5 * np.sum(np.log(2.0 * np.pi * self.var_[i, :]))
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_[i, :]), 1)
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_[i, :]), 1)
  n_ij = -0.5 * np.sum(np.log(2.0 * np.pi * self.var_[i, :]))
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_[i, :]), 1)
  n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) / (self.var_[i, :]), 1)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [26]:
def evaluate_all_datasets():
    datasets = [
        (df_reduce_mrmr_instances, ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23'], "mRMR ClusterCentroids_soft"),
        (df_reduce_RFC_instances, ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7'], "RFC ClusterCentroids_soft"),
        (df_reduce_mrmr_instances_hard, ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23'], "mRMR ClusterCentroids_hard"),
        (df_reduce_RFC_instances_hard, ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7'], "RFC ClusterCentroids_hard"),
        (df_reduce_mrmr_instances_GLVQ, ['V17', 'Time', 'Amount', 'V25', 'V20', 'V7', 'V13', 'V22', 'V19', 'V23'], "mRMR GLVQ"),
        (df_reduce_RFC_instances_GLVQ, ['V17', 'V16', 'V12', 'V14', 'V11', 'V10', 'V9', 'V4', 'V18', 'V7'], "RFC GLVQ")
    ]

    for dataset, columns_to_keep, name in datasets:
        print(f"\n--- Evaluación con {name} ---")
        X = dataset.drop(columns=['Class'])
        y = dataset['Class']
        train_and_evaluate_naive_bayes(X, y, test_data, columns_to_keep)

# Ejecutar la evaluación
evaluate_all_datasets()


--- Evaluación con mRMR ClusterCentroids_soft ---
Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Precisión en el conjunto de prueba: 0.60
Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[ 2197 26235]
 [    2    47]]

Reporte de Clasificación:
              precision    recall  f1-score   support

   Correctas       1.00      0.08      0.14     28432
Fraudulentas       0.00      0.96      0.00        49

    accuracy                           0.08     28481
   macro avg       0.50      0.52      0.07     28481
weighted avg       1.00      0.08      0.14     28481


--- Evaluación con RFC ClusterCentroids_soft ---
Dimensiones de los conjuntos:
Conjunto de entrenamiento: (708, 10), (708,)
Conjunto de prueba: (178, 10), (178,)
Precisión en el conjunto de prueba: 0.89
Class
0    28432
1       49
Name: count, dtype: int64
Matriz de confusión:
[[28362    70]
 [    7    42]]

Reporte de Clasificación

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_final[col] = scaler.fit_transform(X_test_final[[col]])
A value is trying t