## Analizando datos

| Field Name     | Field Description                                                                                                     |
|----------------|-----------------------------------------------------------------------------------------------------------------------|
| Fyear          | Año                                                                                                                   |
| Gvkey          |   LLave identificadora campo fraude                                                                                                                    |
| P_aaer         | La variable "p_aaer" se utiliza para gestionar el problema del fraude en serie. El fraude contable puede abarcar varios períodos consecutivos de presentación de informes, creando una situación del llamado “fraude en serie” |
| Misstate       | Etiqueta de fraude (1 indica fraude y 0 indica no fraude)                                                             |
| act            | Activos circulantes, total                                                                                           |
| ap             | Cuentas por pagar, Comercio                                                                                         |
| at             | Activos, Total                                                                                                       |
| ceq            | Patrimonio común/ordinario, total                                                                                   |
| che            | Efectivo e inversiones a corto plazo                                                                                |
| cogs           | Costo de los bienes vendidos                                                                                        |
| csho           | Acciones ordinarias en circulación                                                                                  |
| dlc            | Deuda en pasivos corrientes, total                                                                                  |
| dltis          | Emisión de deuda a largo plazo                                                                                     |
| dltt           | Deuda a largo plazo, total                                                                                          |
| dp             | Depreciación y Amortización                                                                                         |
| ib             | Ingresos antes de partidas extraordinarias                                                                          |
| invt           | Inventarios, Total                                                                                                  |
| ivao           | Inversiones y Anticipos, Otros                                                                                      |
| ivst           | Inversiones a corto plazo, total                                                                                    |
| lct            | Pasivos corrientes, total                                                                                           |
| lt             | Pasivos, Total                                                                                                      |
| ni             | Utilidad (Pérdida) Neta                                                                                             |
| ppegt          | Propiedades, Planta y Equipo, Total                                                                                 |
| pstk           | Acciones preferentes/preferentes (capital), total                                                                    |
| re             | Ganancias retenidas                                                                                                 |
| rect           | Cuentas por cobrar, total                                                                                           |
| sale           | Ventas/facturación (neto)                                                                                           |
| sstk           | Venta de acciones ordinarias y preferentes                                                                          |
| txp            | Impuestos sobre la renta a pagar                                                                                    |
| txt            | Impuestos sobre la renta, total                                                                                     |
| xint           | Intereses y gastos relacionados, total                                                                              |
| prcc_f         | Precio de cierre, anual, fiscal                                                                                     |
| dch_wc         | Acumulaciones de WC                                                                                                |
| ch_rsst        | Acumulaciones RSST                                                                                                  |
| dch_rec        | Cambio en cuentas por cobrar                                                                                        |
| dch_inv        | Cambio en el inventario                                                                                             |
| soft_assets    | % Activos blandos                                                                                                   |
| ch_cs          | Cambio en las ventas en efectivo                                                                                    |
| ch_cm          | Cambio en el margen de efectivo                                                                                     |
| ch_roa         | Cambio en el rendimiento de los activos                                                                             |
| issue          | Emisión efectiva                                                                                                    |
| bm             | Reserva al mercado                                                                                                  |
| dpi            | Índice de depreciación                                                                                              |
| reoa           | Ganancias retenidas sobre activos totales                                                                           |
| EBIT           | Ganancias antes de intereses e impuestos sobre activos totales                                                       |
| ch_fcf         | Cambio en los flujos de efectivo libres                                                                             |

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Ignorar advertencias
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Cargar los datos
data = pd.read_csv("data_FraudDetection_JAR2020.csv")
# Mostrar información inicial
print("Muestra de datos:")
data.sample(10)

Muestra de datos:


Unnamed: 0,fyear,gvkey,p_aaer,misstate,act,ap,at,ceq,che,cogs,...,soft_assets,ch_cs,ch_cm,ch_roa,issue,bm,dpi,reoa,EBIT,ch_fcf
127018,2011,140226,,0,0.144,0.845,1.294,0.449,0.106,0.0,...,0.029366,-2.315789,-1.832381,0.689535,1,0.109045,,-9.315301,-0.40881,3.024401
103290,2007,23100,,0,0.373,1.361,0.472,-1.74,0.002,6.603,...,0.830508,0.491948,-0.658015,1.621628,0,-0.685935,0.761472,-69.167373,-3.682203,3.9864
100499,2006,148950,,0,472.3,62.2,1406.6,403.9,161.0,588.6,...,0.767311,0.10868,0.007656,-0.166155,1,0.579612,1.018337,-0.00974,-0.174961,-0.022365
67831,2001,22008,,0,1.031,0.357,3.382,0.354,0.912,0.841,...,0.135719,5.196335,-1.225487,0.163954,0,0.01134,1.585632,-2.481076,-0.033412,0.236832
4064,1990,15711,,0,485.714,118.497,965.178,316.541,56.172,1113.921,...,0.455697,-0.104395,-0.255108,-0.018397,1,2.350743,1.336793,0.007655,0.041706,-0.114653
4733,1991,1655,,0,210.762,16.232,335.173,190.303,6.121,294.713,...,0.692884,0.068592,-0.108945,-0.004615,1,0.546556,0.991929,0.539363,0.129814,-0.030444
131680,2012,104092,,0,312.409,16.593,1570.957,736.946,185.126,119.669,...,0.524469,,,,1,0.100441,,0.187183,0.303051,
106410,2007,161993,,0,66.302,37.66,956.636,602.498,18.841,58.841,...,0.052815,0.407386,0.221776,0.016563,1,0.457185,0.757575,0.101679,0.137078,-0.058835
136294,2013,20245,,0,60.944,0.533,77.746,71.55,60.027,40.734,...,0.206107,-6.484197,-1.022162,-0.421198,1,0.575233,0.964875,-5.622617,-0.498534,-0.449309
136545,2013,25501,,0,9.9,19.573,170.156,116.932,0.0,25.88,...,0.058182,0.244365,0.23731,0.061518,1,0.857228,0.891149,-0.147823,-0.012412,0.054168


In [3]:
# Eliminar columnas no relevantes
data = data.drop(['gvkey', 'p_aaer', 'fyear'], axis='columns')
print("\nDatos tras eliminar columnas no relevantes:")
data.head(10)


Datos tras eliminar columnas no relevantes:


Unnamed: 0,misstate,act,ap,at,ceq,che,cogs,csho,dlc,dltis,...,soft_assets,ch_cs,ch_cm,ch_roa,issue,bm,dpi,reoa,EBIT,ch_fcf
0,0,10.047,3.736,32.335,6.262,0.002,30.633,2.526,3.283,32.853,...,0.312448,0.095082,0.082631,-0.019761,1,0.41317,0.873555,0.16762,0.161961,-0.04214
1,0,1.247,0.803,7.784,0.667,0.171,1.125,3.556,0.021,2.017,...,0.315904,0.188832,-0.211389,-0.117832,1,0.157887,0.745139,-0.428957,-0.157888,0.100228
2,0,55.04,3.601,118.12,44.393,3.132,107.343,3.882,6.446,6.5,...,0.605342,0.097551,-0.10578,0.091206,1,2.231337,1.015131,0.394768,0.063681,0.066348
3,0,24.684,3.948,34.591,7.751,0.411,31.214,4.755,8.791,0.587,...,0.793068,-0.005725,-0.249704,0.017545,1,1.043582,1.026261,0.094822,0.088347,-0.017358
4,0,17.325,3.52,27.542,-12.142,1.017,32.662,6.735,32.206,0.0,...,0.869182,-0.231536,-1.674893,-0.466667,0,-1.602508,0.598443,-0.942379,-0.700821,0.130349
5,0,148.396,24.301,328.495,111.015,8.478,153.262,11.235,44.339,0.273,...,0.688689,0.040056,0.092675,0.003067,1,0.389406,0.851688,0.191741,0.105527,-0.034367
6,0,637.88,199.012,1011.901,324.132,113.271,1185.288,28.489,22.444,60.769,...,0.754448,-0.033881,-0.37244,-0.040405,1,1.379084,0.95572,-0.038167,0.055174,-0.04216
7,0,396.594,92.14,677.736,183.566,50.125,596.137,46.758,80.971,1.146,...,0.819713,0.047023,-0.061932,-0.108796,1,1.847471,0.964188,-0.192385,-0.031264,-0.041039
8,0,2657.8,966.3,13353.6,3727.4,949.3,10908.2,62.3,1319.1,2264.6,...,0.243882,0.117693,-0.369057,-0.047424,1,1.236793,0.996031,0.181389,0.022773,-0.038199
9,0,0.004,0.0,0.126,-0.778,0.0,0.0,9.928,0.624,0.132,...,0.809524,,,-1.520085,1,-0.156728,,-22.174603,-7.016393,0.118902


In [4]:
# Imputación de datos con reglas de negocio (reemplazar valores perdidos con cero)
columns_to_impute = ['dch_wc', 'ch_rsst', 'dch_rec', 'dch_inv', 'ch_cs', 'ch_cm', 'ch_roa', 'bm', 'reoa', 'EBIT', 'ch_fcf', 'soft_assets', 'dpi']
data[columns_to_impute] = data[columns_to_impute].fillna(0)
print("\nDatos tras imputación con ceros:")
data[columns_to_impute].isnull().sum()


Datos tras imputación con ceros:


dch_wc         0
ch_rsst        0
dch_rec        0
dch_inv        0
ch_cs          0
ch_cm          0
ch_roa         0
bm             0
reoa           0
EBIT           0
ch_fcf         0
soft_assets    0
dpi            0
dtype: int64

In [5]:
# Guardar el dataset
data.to_csv('fraude1.csv', index=False)

In [6]:
# Nuevo conjunto de datos
data1 = pd.read_csv('fraude1.csv')
print("\nNuevo conjunto de datos:")
data1.sample(10)


Nuevo conjunto de datos:


Unnamed: 0,misstate,act,ap,at,ceq,che,cogs,csho,dlc,dltis,...,soft_assets,ch_cs,ch_cm,ch_roa,issue,bm,dpi,reoa,EBIT,ch_fcf
52707,0,1828.4,619.1,7258.2,1127.3,10.3,4261.1,214.0,81.9,356.7,...,0.398969,-0.049879,0.017007,0.004535,1,0.192429,0.991279,-0.26716,0.12656,-0.029727
35273,0,1.885,1.204,2.549,0.682,0.027,1.214,22.297,0.286,0.086,...,0.851314,-0.292564,-1.521688,-0.008525,1,4.937457,1.305474,-0.73676,0.0153,-0.183876
69457,0,9.276,2.037,32.502,4.714,5.509,31.261,13.769,1.445,0.0,...,0.269737,0.298149,-0.045174,-0.173753,1,0.084954,0.891006,-2.530398,-0.509599,-0.183973
38896,0,2988.398,2595.255,4965.743,866.739,644.104,2618.394,162.115,17.672,248.091,...,0.822026,0.070371,-5.290689,0.002804,1,0.12617,0.929362,0.102118,0.084984,0.002468
129289,0,133.892,12.345,434.812,275.936,30.186,206.398,27.101,10.0,96.588,...,0.831633,0.245218,0.202536,0.015029,1,0.389807,0.982184,0.404115,0.118674,-0.047487
143744,0,16.668,3.374,17.193,13.819,16.185,0.0,93.512,0.0,0.0,...,0.028093,0.409836,0.480763,0.115756,1,0.230903,0.835182,-14.522538,-1.082592,0.09869
18159,0,44.876,4.578,93.569,56.992,5.775,63.47,9.971,3.187,15.0,...,0.818198,0.847563,0.470759,0.093775,1,1.678149,0.781168,0.008689,0.072171,-0.086398
116406,0,11.627,6.468,16.634,2.687,0.0,39.947,96.708,5.311,0.0,...,0.805098,0.091566,-0.682263,-0.169268,0,0.505176,0.837619,-2.421366,-0.223759,0.046998
116237,0,0.045,0.081,6.652,5.731,0.013,0.012,26.581,0.329,0.685,...,0.9316,-0.450331,-1.027729,0.011209,1,0.829251,0.916313,-1.981058,-0.225646,0.156295
29344,0,121.625,6.37,222.972,187.084,83.213,70.583,6.461,0.0,0.0,...,0.510862,0.032489,0.176771,-0.012997,1,0.851644,1.021455,0.784573,-0.061833,0.060287


In [7]:
# Preparar datos para el modelo
X = data1.drop('misstate', axis=1)
y = data1['misstate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [8]:
# Escalar datos con StandardScaler
std_scaler = StandardScaler()
X_train_scaled_std = std_scaler.fit_transform(X_train)
X_test_scaled_std = std_scaler.transform(X_test)

In [9]:
# Después de escalar los datos en tu Jupyter Notebook
# Guardar resultados de escalado
scaling_results = {
    'scaler_type': std_scaler,  # Asegúrate de que estás guardando el objeto scaler
    'X_train_scaled_std': X_train_scaled_std,
    'X_test_scaled_std': X_test_scaled_std
}

In [10]:
# Entrenar un modelo RandomForest con StandardScaler
clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train_scaled_std, y_train)


In [11]:
# Guardar importancia de características
feature_importances = pd.Series(clf_rnd.feature_importances_, index=X_train.columns).sort_values(ascending=False)

In [12]:
# Guardar resultados de reducción de características
reduction_results = {
    'selected_features': list(feature_importances.head(10).index),
    'X_train_reduced': X_train[feature_importances.head(10).index].copy(),
    'X_test_reduced': X_test[feature_importances.head(10).index].copy()
}

In [13]:
# Guardar información de las características reducidas
print("\nInformación de las características reducidas (entrenamiento y prueba):")
reduction_results['X_train_reduced'].info()
reduction_results['X_test_reduced'].info()


Información de las características reducidas (entrenamiento y prueba):
<class 'pandas.core.frame.DataFrame'>
Index: 102231 entries, 83313 to 121958
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   dch_rec      102231 non-null  float64
 1   soft_assets  102231 non-null  float64
 2   csho         102231 non-null  float64
 3   prcc_f       102231 non-null  float64
 4   che          102231 non-null  float64
 5   rect         102231 non-null  float64
 6   ch_cs        102231 non-null  float64
 7   dch_wc       102231 non-null  float64
 8   reoa         102231 non-null  float64
 9   bm           102231 non-null  float64
dtypes: float64(10)
memory usage: 8.6 MB
<class 'pandas.core.frame.DataFrame'>
Index: 43814 entries, 21895 to 885
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   dch_rec      43814 non-null  float64
 1   soft_assets  43814 non-n

In [14]:
import pickle

# Cargar resultados de escalado
scaling_results_path = 'scaling_results.pkl'
with open(scaling_results_path, 'rb') as file:
    scaling_results = pickle.load(file)

# Cargar resultados de reducción
reduction_results_path = 'reduction_results.pkl'
with open(reduction_results_path, 'rb') as file:
    reduction_results = pickle.load(file)


## Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

# Inicializar el modelo de regresión logística
logreg_model = LogisticRegression(random_state=42)

# Entrenar el modelo con datos reducidos y escalados
logreg_model.fit(reduction_results['X_train_reduced'], y_train)

# Predecir en el conjunto de prueba
y_pred = logreg_model.predict(reduction_results['X_test_reduced'])

# Calcular la matriz de confusión
conf_matrix = confusion_matrix(y_test, y_pred)

# Calcular precisión, exhaustividad y puntuación F1
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Calcular la puntuación AUC-ROC
y_prob = logreg_model.predict_proba(reduction_results['X_test_reduced'])[:, 1]
roc_auc = roc_auc_score(y_test, y_prob)

# Imprimir las métricas
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'AUC-ROC Score: {roc_auc}')

# Mostrar la matriz de confusión
print('\nConfusion Matrix:')
print(conf_matrix)

Accuracy: 0.9934724060802483
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
AUC-ROC Score: 0.4861004067131944

Confusion Matrix:
[[43528     2]
 [  284     0]]


In [17]:
# Guardar el modelo en un archivo usando pickle
with open('logistic_model_3.pkl', 'wb') as file:
    pickle.dump(logreg_model, file)

##  Random Forest

In [18]:
from sklearn.ensemble import RandomForestClassifier

# Inicializar el modelo de Random Forest
rf_model = RandomForestClassifier(random_state=42)

# Entrenar el modelo con datos reducidos y escalados
rf_model.fit(reduction_results['X_train_reduced'], y_train)

# Predecir en el conjunto de prueba
y_pred_rf = rf_model.predict(reduction_results['X_test_reduced'])

# Calcular la matriz de confusión
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)

# Calcular precisión, exhaustividad y puntuación F1
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

# Calcular la puntuación AUC-ROC
y_prob_rf = rf_model.predict_proba(reduction_results['X_test_reduced'])[:, 1]
roc_auc_rf = roc_auc_score(y_test, y_prob_rf)

# Imprimir las métricas
print(f'Accuracy: {accuracy_score(y_test, y_pred_rf)}')
print(f'Precision: {precision_rf}')
print(f'Recall: {recall_rf}')
print(f'F1 Score: {f1_rf}')
print(f'AUC-ROC Score: {roc_auc_rf}')

# Mostrar la matriz de confusión
print('\nConfusion Matrix:')
print(conf_matrix_rf)


Accuracy: 0.9935180535901766
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
AUC-ROC Score: 0.7172249266330812

Confusion Matrix:
[[43530     0]
 [  284     0]]


In [19]:
# Guardar el modelo en un archivo usando pickle
with open('rf_model.pkl', 'wb') as file:
    pickle.dump(rf_model, file)

## Support Vector Machine

In [20]:
from sklearn.svm import SVC

# Inicializar el modelo de Support Vector Machine
svm_model = SVC(random_state=42, probability=True)  # Se utiliza probability=True para poder calcular las probabilidades

# Entrenar el modelo con datos reducidos y escalados
svm_model.fit(reduction_results['X_train_reduced'], y_train)

# Predecir en el conjunto de prueba
y_pred_svm = svm_model.predict(reduction_results['X_test_reduced'])

# Calcular la matriz de confusión
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)

# Calcular precisión, exhaustividad y puntuación F1
precision_svm = precision_score(y_test, y_pred_svm)
recall_svm = recall_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm)

# Calcular la puntuación AUC-ROC
y_prob_svm = svm_model.predict_proba(reduction_results['X_test_reduced'])[:, 1]
roc_auc_svm = roc_auc_score(y_test, y_prob_svm)

# Imprimir las métricas
print(f'Accuracy: {accuracy_score(y_test, y_pred_svm)}')
print(f'Precision: {precision_svm}')
print(f'Recall: {recall_svm}')
print(f'F1 Score: {f1_svm}')
print(f'AUC-ROC Score: {roc_auc_svm}')

# Mostrar la matriz de confusión
print('\nConfusion Matrix:')
print(conf_matrix_svm)


Accuracy: 0.9935180535901766
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
AUC-ROC Score: 0.4443593215622705

Confusion Matrix:
[[43530     0]
 [  284     0]]


In [21]:
# Guardar el modelo en un archivo usando pickle
with open('svm_model.pkl', 'wb') as file:
    pickle.dump(svm_model, file)

## Isolation Forest

In [20]:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Inicializar el modelo de Isolation Forest
isolation_forest_model = IsolationForest(random_state=42, contamination='auto')

# Entrenar el modelo con datos reducidos y escalados
isolation_forest_model.fit(reduction_results['X_train_reduced'])

# Predecir en el conjunto de prueba
y_pred_iforest = isolation_forest_model.predict(reduction_results['X_test_reduced'])

# Convertir las predicciones (-1 para anomalía, 1 para normal) a 0 y 1 (0 para anomalía, 1 para normal)
y_pred_iforest[y_pred_iforest == 1] = 0
y_pred_iforest[y_pred_iforest == -1] = 1

# Calcular la matriz de confusión
conf_matrix_iforest = confusion_matrix(y_test, y_pred_iforest)

# Calcular precisión, exhaustividad y puntuación F1
precision_iforest = precision_score(y_test, y_pred_iforest)
recall_iforest = recall_score(y_test, y_pred_iforest)
f1_iforest = f1_score(y_test, y_pred_iforest)

# Imprimir las métricas
print(f'Accuracy: {accuracy_score(y_test, y_pred_iforest)}')
print(f'Precision: {precision_iforest}')
print(f'Recall: {recall_iforest}')
print(f'F1 Score: {f1_iforest}')

# Mostrar la matriz de confusión
print('\nConfusion Matrix:')
print(conf_matrix_iforest)


Accuracy: 0.9370520838088282
Precision: 0.009126984126984128
Recall: 0.08098591549295775
F1 Score: 0.016405135520684736

Confusion Matrix:
[[41033  2497]
 [  261    23]]


In [21]:
# Guardar el modelo en un archivo usando pickle
with open('if_model.pkl', 'wb') as file:
    pickle.dump(isolation_forest_model, file)

## Local Outlier Factor

In [16]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Inicializar el modelo de Local Outlier Factor
lof_model = LocalOutlierFactor(n_neighbors=20, contamination='auto')

# Entrenar el modelo con datos reducidos y escalados
lof_model.fit(reduction_results['X_train_reduced'])

# Predecir en el conjunto de prueba utilizando fit_predict
y_pred_lof = lof_model.fit_predict(reduction_results['X_test_reduced'])

# Convertir las predicciones (-1 para anomalía, 1 para normal) a 0 y 1 (0 para anomalía, 1 para normal)
y_pred_lof[y_pred_lof == 1] = 0
y_pred_lof[y_pred_lof == -1] = 1

# Calcular la matriz de confusión
conf_matrix_lof = confusion_matrix(y_test, y_pred_lof)

# Calcular precisión, exhaustividad y puntuación F1
precision_lof = precision_score(y_test, y_pred_lof)
recall_lof = recall_score(y_test, y_pred_lof)
f1_lof = f1_score(y_test, y_pred_lof)

# No se calcula la puntuación AUC-ROC ya que Local Outlier Factor no proporciona probabilidades

# Imprimir las métricas
print(f'Accuracy: {accuracy_score(y_test, y_pred_lof)}')
print(f'Precision: {precision_lof}')
print(f'Recall: {recall_lof}')
print(f'F1 Score: {f1_lof}')

# Mostrar la matriz de confusión
print('\nConfusion Matrix:')
print(conf_matrix_lof)



Accuracy: 0.9727027890628567
Precision: 0.004347826086956522
Recall: 0.014084507042253521
F1 Score: 0.006644518272425249

Confusion Matrix:
[[42614   916]
 [  280     4]]


In [17]:
# Guardar el modelo en un archivo usando pickle
with open('lof_model.pkl', 'wb') as file:
    pickle.dump(lof_model, file)

## One Class SVM

In [19]:
from sklearn.svm import OneClassSVM
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Inicializar el modelo de One-Class SVM
ocsvm_model = OneClassSVM(gamma='auto', nu=0.05)  # Ajusta los parámetros según sea necesario

# Entrenar el modelo con datos reducidos y escalados (solo utiliza los datos normales para entrenar)
ocsvm_model.fit(reduction_results['X_train_reduced'][y_train == 0])

# Predecir en el conjunto de prueba
y_pred_ocsvm = ocsvm_model.predict(reduction_results['X_test_reduced'])

# Convertir las predicciones (-1 para anomalía, 1 para normal) a 0 y 1 (0 para anomalía, 1 para normal)
y_pred_ocsvm[y_pred_ocsvm == 1] = 0
y_pred_ocsvm[y_pred_ocsvm == -1] = 1

# Calcular la matriz de confusión
conf_matrix_ocsvm = confusion_matrix(y_test, y_pred_ocsvm)

# Calcular precisión, exhaustividad y puntuación F1
precision_ocsvm = precision_score(y_test, y_pred_ocsvm)
recall_ocsvm = recall_score(y_test, y_pred_ocsvm)
f1_ocsvm = f1_score(y_test, y_pred_ocsvm)

# No se calcula la puntuación AUC-ROC ya que One-Class SVM no proporciona probabilidades

# Imprimir las métricas
print(f'Accuracy: {accuracy_score(y_test, y_pred_ocsvm)}')
print(f'Precision: {precision_ocsvm}')
print(f'Recall: {recall_ocsvm}')
print(f'F1 Score: {f1_ocsvm}')

# Mostrar la matriz de confusión
print('\nConfusion Matrix:')
print(conf_matrix_ocsvm)


Accuracy: 0.3833021408682156
Precision: 0.00824014125956445
Recall: 0.7887323943661971
F1 Score: 0.016309887869520895

Confusion Matrix:
[[16570 26960]
 [   60   224]]


In [20]:
# Guardar el modelo en un archivo usando pickle
with open('ocsvm_model.pkl', 'wb') as file:
    pickle.dump(ocsvm_model, file)