# Detección de Fraude - Implementación Avanzada
## Análisis de Científico de Datos Senior

Este notebook implementa una solución robusta y superior para la detección de fraude, abordando las debilidades del modelo original y proporcionando una metodología completa de machine learning.

## Fase 1: Crítica Constructiva del Modelo Original

### Problemas Identificados:
1. **Manejo inadecuado de valores nulos**: Reemplazar por 0 introduce sesgo
2. **Codificación incorrecta de categóricas**: Asignación arbitraria de números
3. **Métricas insuficientes**: Solo accuracy, ignorando precision/recall
4. **Desbalance no tratado**: 63.5% vs 36.5% sin SMOTE
5. **Falta de preprocesamiento**: Sin escalado ni feature engineering


In [None]:
# Importar librerías necesarias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import (classification_report, confusion_matrix, 
                           precision_score, recall_score, f1_score, 
                           roc_auc_score, roc_curve, accuracy_score)
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Configuración de gráficos
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

## Fase 2: Carga y Análisis Exploratorio de Datos

In [None]:
# Cargar datasets
df_train = pd.read_excel('entrenamiento_fraude.xlsx')
df_test = pd.read_excel('testeo_fraude.xlsx')
df_eval = pd.read_excel('base_evaluada.xlsx')

print(f"Dataset entrenamiento: {df_train.shape}")
print(f"Dataset testeo: {df_test.shape}")
print(f"Dataset evaluación: {df_eval.shape}")

# Análisis del balance de clases
class_distribution = df_train['fraude'].value_counts()
print(f"\nBalance de clases:")
print(f"No Fraude (0): {class_distribution[0]} ({class_distribution[0]/len(df_train)*100:.1f}%)")
print(f"Fraude (1): {class_distribution[1]} ({class_distribution[1]/len(df_train)*100:.1f}%)")

In [None]:
# Visualización del balance de clases
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Gráfico de barras
class_distribution.plot(kind='bar', ax=ax1)
ax1.set_title('Distribución de Clases')
ax1.set_xlabel('Clase (0=No Fraude, 1=Fraude)')
ax1.set_ylabel('Número de Casos')
ax1.tick_params(axis='x', rotation=0)

# Gráfico de pie
ax2.pie(class_distribution.values, labels=['No Fraude', 'Fraude'], autopct='%1.1f%%')
ax2.set_title('Proporción de Clases')

plt.tight_layout()
plt.show()

print(f"Ratio de desbalance: {class_distribution[0]/class_distribution[1]:.2f}:1")

In [None]:
# Análisis de valores faltantes
missing_data = df_train.isnull().sum()
missing_data = missing_data[missing_data > 0].sort_values(ascending=False)

if len(missing_data) > 0:
    plt.figure(figsize=(10, 6))
    missing_data.plot(kind='bar')
    plt.title('Valores Faltantes por Variable')
    plt.xlabel('Variables')
    plt.ylabel('Número de Valores Faltantes')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    print(f"Variables con valores faltantes: {len(missing_data)}")
    print(f"Total de valores faltantes: {missing_data.sum()}")
else:
    print("No hay valores faltantes en el dataset")

## Fase 2: Preprocesamiento Avanzado

In [None]:
# Identificar tipos de variables
categorical_features = ['descri_apli_prod_ben', 'marca_timeout', 'marca_host_no_resp']
numerical_features = [col for col in df_train.columns 
                     if col not in categorical_features + ['radicado', 'fraude']]

print(f"Variables categóricas: {len(categorical_features)}")
print(f"Variables numéricas: {len(numerical_features)}")

# Separar features y target
X = df_train.drop(['radicado', 'fraude'], axis=1)
y = df_train['fraude']

print(f"\nShape de X: {X.shape}")
print(f"Shape de y: {y.shape}")

In [None]:
# 1. Imputación avanzada
print("🔧 Aplicando imputación avanzada...")

# Para categóricas: moda
for col in categorical_features:
    if col in X.columns:
        mode_value = X[col].mode().iloc[0] if not X[col].mode().empty else 'Unknown'
        X[col] = X[col].fillna(mode_value)
        print(f"✅ {col}: Imputado con moda")

# Para numéricas: KNNImputer
numerical_cols_in_X = [col for col in numerical_features if col in X.columns]
if len(numerical_cols_in_X) > 0:
    knn_imputer = KNNImputer(n_neighbors=5)
    X[numerical_cols_in_X] = knn_imputer.fit_transform(X[numerical_cols_in_X])
    print(f"✅ Variables numéricas: KNNImputer aplicado")

print(f"Valores faltantes restantes: {X.isnull().sum().sum()}")

In [None]:
# 2. One-Hot Encoding
print("🔧 Aplicando One-Hot Encoding...")
X_encoded = pd.get_dummies(X, columns=categorical_features, prefix=categorical_features, drop_first=True)
print(f"Dimensiones después de encoding: {X_encoded.shape}")

# 3. Escalado
print("🔧 Aplicando escalado...")
scaler = StandardScaler()
numerical_cols_encoded = [col for col in X_encoded.columns if col in numerical_cols_in_X]
X_scaled = X_encoded.copy()
X_scaled[numerical_cols_encoded] = scaler.fit_transform(X_encoded[numerical_cols_encoded])
print(f"Variables numéricas escaladas: {len(numerical_cols_encoded)}")

In [None]:
# 4. División de datos
X_train, X_val, y_train, y_val = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Entrenamiento: {X_train.shape[0]} muestras")
print(f"Validación: {X_val.shape[0]} muestras")

# 5. SMOTE para balancear
print("\n🔧 Aplicando SMOTE...")
original_distribution = pd.Series(y_train).value_counts()
print(f"Distribución original: {dict(original_distribution)}")

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

balanced_distribution = pd.Series(y_train_balanced).value_counts()
print(f"Distribución después de SMOTE: {dict(balanced_distribution)}")

## Fase 3: Modelado y Optimización

In [None]:
# Definir modelos
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='logloss')
}

# Evaluación inicial
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model_results = {}

print("🏃‍♂️ Evaluando modelos base...")
for name, model in models.items():
    print(f"\nEvaluando {name}...")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_balanced, y_train_balanced, 
                               cv=cv_strategy, scoring='roc_auc', n_jobs=-1)
    
    # Fit y predict
    model.fit(X_train_balanced, y_train_balanced)
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)[:, 1]
    
    # Métricas
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred)
    recall = recall_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred)
    auc_score = roc_auc_score(y_val, y_pred_proba)
    
    model_results[name] = {
        'cv_auc_mean': cv_scores.mean(),
        'cv_auc_std': cv_scores.std(),
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'auc': auc_score
    }
    
    print(f"CV AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
    print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")

In [None]:
# Comparación visual de modelos
results_df = pd.DataFrame(model_results).T

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# AUC
axes[0,0].bar(results_df.index, results_df['cv_auc_mean'])
axes[0,0].set_title('AUC Score (Cross-Validation)')
axes[0,0].set_ylabel('AUC')
axes[0,0].tick_params(axis='x', rotation=45)

# Precision
axes[0,1].bar(results_df.index, results_df['precision'])
axes[0,1].set_title('Precision')
axes[0,1].set_ylabel('Precision')
axes[0,1].tick_params(axis='x', rotation=45)

# Recall
axes[1,0].bar(results_df.index, results_df['recall'])
axes[1,0].set_title('Recall')
axes[1,0].set_ylabel('Recall')
axes[1,0].tick_params(axis='x', rotation=45)

# F1-Score
axes[1,1].bar(results_df.index, results_df['f1'])
axes[1,1].set_title('F1-Score')
axes[1,1].set_ylabel('F1-Score')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Mejor modelo
best_model_name = max(model_results.keys(), key=lambda x: model_results[x]['cv_auc_mean'])
print(f"\n🏆 Mejor modelo: {best_model_name}")
print(f"AUC: {model_results[best_model_name]['cv_auc_mean']:.4f}")

In [None]:
# Optimización de hiperparámetros
print(f"⚙️ Optimizando hiperparámetros para {best_model_name}...")

if best_model_name == 'Random Forest':
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [10, 20, None],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    }
    base_model = RandomForestClassifier(random_state=42)
elif best_model_name == 'Logistic Regression':
    param_grid = {
        'C': [0.1, 1, 10],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
    }
    base_model = LogisticRegression(random_state=42, max_iter=1000)
else:  # XGBoost
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [3, 6],
        'learning_rate': [0.1, 0.2],
        'subsample': [0.8, 1.0]
    }
    base_model = xgb.XGBClassifier(random_state=42, eval_metric='logloss')

grid_search = GridSearchCV(
    base_model, param_grid, cv=cv_strategy, 
    scoring='roc_auc', n_jobs=-1, verbose=0
)

grid_search.fit(X_train_balanced, y_train_balanced)
best_model = grid_search.best_estimator_

print(f"✅ Mejores parámetros: {grid_search.best_params_}")
print(f"✅ Mejor AUC: {grid_search.best_score_:.4f}")

## Fase 4: Evaluación Rigurosa

In [None]:
# Predicciones finales
y_pred_final = best_model.predict(X_val)
y_pred_proba_final = best_model.predict_proba(X_val)[:, 1]

# Métricas finales
final_accuracy = accuracy_score(y_val, y_pred_final)
final_precision = precision_score(y_val, y_pred_final)
final_recall = recall_score(y_val, y_pred_final)
final_f1 = f1_score(y_val, y_pred_final)
final_auc = roc_auc_score(y_val, y_pred_proba_final)

print("📈 MÉTRICAS FINALES:")
print(f"Accuracy: {final_accuracy:.4f} ({final_accuracy*100:.1f}%)")
print(f"Precision: {final_precision:.4f} ({final_precision*100:.1f}%)")
print(f"Recall: {final_recall:.4f} ({final_recall*100:.1f}%)")
print(f"F1-Score: {final_f1:.4f}")
print(f"ROC-AUC: {final_auc:.4f}")

In [None]:
# Matriz de confusión
cm = confusion_matrix(y_val, y_pred_final)
tn, fp, fn, tp = cm.ravel()

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Fraude', 'Fraude'],
            yticklabels=['No Fraude', 'Fraude'])
plt.title('Matriz de Confusión')
plt.xlabel('Predicción')
plt.ylabel('Real')
plt.show()

print(f"\n💼 INTERPRETACIÓN DE NEGOCIO:")
print(f"Fraudes detectados correctamente: {tp} de {tp+fn} ({tp/(tp+fn)*100:.1f}%)")
print(f"Falsos positivos (falsa alarma): {fp}")
print(f"Fraudes no detectados (pérdida): {fn}")
print(f"Casos normales correctos: {tn}")

fraud_detection_rate = tp / (tp + fn) * 100
false_positive_rate = fp / (fp + tn) * 100
print(f"\nTasa de detección de fraude: {fraud_detection_rate:.1f}%")
print(f"Tasa de falsas alarmas: {false_positive_rate:.1f}%")

In [None]:
# Curva ROC
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba_final)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {final_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

In [None]:
# Feature importance (si disponible)
if hasattr(best_model, 'feature_importances_'):
    feature_names = X_train_balanced.columns
    importances = best_model.feature_importances_
    feature_importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    # Top 15 features
    plt.figure(figsize=(10, 8))
    top_features = feature_importance_df.head(15)
    sns.barplot(data=top_features, y='feature', x='importance')
    plt.title('Top 15 Features Más Importantes')
    plt.xlabel('Importancia')
    plt.tight_layout()
    plt.show()
    
    print("🎯 TOP 10 FEATURES MÁS IMPORTANTES:")
    for i, (_, row) in enumerate(feature_importance_df.head(10).iterrows()):
        print(f"{i+1:2d}. {row['feature']}: {row['importance']:.4f}")

## Aplicación del Modelo al Dataset de Testeo

In [None]:
# Preprocesar dataset de testeo con la misma metodología
print("🔧 Preprocesando dataset de testeo...")

# Separar features del dataset de testeo
X_test_raw = df_test.drop(['radicado'], axis=1)

# 1. Imputación (mismo método)
for col in categorical_features:
    if col in X_test_raw.columns:
        mode_value = X[col].mode().iloc[0] if not X[col].mode().empty else 'Unknown'
        X_test_raw[col] = X_test_raw[col].fillna(mode_value)

# Para numéricas: usar el mismo imputer entrenado
numerical_cols_test = [col for col in numerical_features if col in X_test_raw.columns]
if len(numerical_cols_test) > 0:
    X_test_raw[numerical_cols_test] = knn_imputer.transform(X_test_raw[numerical_cols_test])

# 2. One-Hot Encoding (asegurar mismas columnas)
X_test_encoded = pd.get_dummies(X_test_raw, columns=categorical_features, prefix=categorical_features, drop_first=True)

# Asegurar que tengas las mismas columnas que en entrenamiento
missing_cols = set(X_encoded.columns) - set(X_test_encoded.columns)
for col in missing_cols:
    X_test_encoded[col] = 0

# Reordenar columnas
X_test_encoded = X_test_encoded[X_encoded.columns]

# 3. Escalado (usar el mismo scaler)
X_test_scaled = X_test_encoded.copy()
numerical_cols_test_encoded = [col for col in X_test_encoded.columns if col in numerical_cols_encoded]
X_test_scaled[numerical_cols_test_encoded] = scaler.transform(X_test_encoded[numerical_cols_test_encoded])

print(f"✅ Dataset de testeo preprocesado: {X_test_scaled.shape}")

In [None]:
# Realizar predicciones en el dataset de testeo
print("🔮 Realizando predicciones en dataset de testeo...")

test_predictions = best_model.predict(X_test_scaled)
test_probabilities = best_model.predict_proba(X_test_scaled)[:, 1]

# Crear DataFrame con resultados
results_df = pd.DataFrame({
    'radicado': df_test['radicado'],
    'fraude_prediccion': test_predictions,
    'probabilidad_fraude': test_probabilities
})

# Estadísticas de predicciones
pred_distribution = pd.Series(test_predictions).value_counts()
print(f"\n📊 RESULTADOS EN DATASET DE TESTEO:")
print(f"Total de casos: {len(test_predictions)}")
print(f"Predicciones No Fraude (0): {pred_distribution.get(0, 0)} ({pred_distribution.get(0, 0)/len(test_predictions)*100:.1f}%)")
print(f"Predicciones Fraude (1): {pred_distribution.get(1, 0)} ({pred_distribution.get(1, 0)/len(test_predictions)*100:.1f}%)")

# Mostrar casos con mayor probabilidad de fraude
print(f"\n🚨 TOP 10 CASOS CON MAYOR PROBABILIDAD DE FRAUDE:")
top_fraud_cases = results_df.nlargest(10, 'probabilidad_fraude')
for idx, row in top_fraud_cases.iterrows():
    print(f"Radicado: {row['radicado']}, Probabilidad: {row['probabilidad_fraude']:.4f}")

# Guardar resultados
results_df.to_excel('predicciones_fraude_mejoradas.xlsx', index=False)
print(f"\n💾 Resultados guardados en 'predicciones_fraude_mejoradas.xlsx'")

In [None]:
# Visualización de distribución de probabilidades
plt.figure(figsize=(12, 5))

# Histograma de probabilidades
plt.subplot(1, 2, 1)
plt.hist(test_probabilities, bins=50, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('Distribución de Probabilidades de Fraude')
plt.xlabel('Probabilidad de Fraude')
plt.ylabel('Frecuencia')
plt.axvline(x=0.5, color='red', linestyle='--', label='Threshold = 0.5')
plt.legend()

# Boxplot por predicción
plt.subplot(1, 2, 2)
results_df.boxplot(column='probabilidad_fraude', by='fraude_prediccion', ax=plt.gca())
plt.title('Probabilidades por Predicción')
plt.xlabel('Predicción (0=No Fraude, 1=Fraude)')
plt.ylabel('Probabilidad de Fraude')

plt.tight_layout()
plt.show()

## Conclusiones y Recomendaciones

### Mejoras Implementadas vs Modelo Original:
1. ✅ **Imputación KNN** vs llenar con 0
2. ✅ **One-hot encoding** vs asignación arbitraria
3. ✅ **Balanceo con SMOTE** vs datos desbalanceados
4. ✅ **Múltiples métricas** vs solo accuracy
5. ✅ **Optimización de hiperparámetros**
6. ✅ **Escalado de features**

### Próximos Pasos:
- Feature engineering adicional
- Ensemble methods
- Ajuste de threshold según costos de negocio
- Pipeline de producción automatizado
- Monitoreo continuo del modelo
