# FEATURE ENGINEERING AND RED FLAGS

### Features Basadas en Red Flags AML

####  Round Amount Detection

Detecta montos "redondos" que son típicos en lavado de dinero



In [1]:
from src.utils import get_sample_data

df_sample = get_sample_data()



In [None]:
from src.feature_engineerings import create_round_amount_features

# Aplicar
df = create_round_amount_features(df_sample)

# Validar efectividad
round_fraud_rate = df[df['round_10000'] == 1]['isFraud'].mean()
normal_fraud_rate = df[df['round_10000'] == 0]['isFraud'].mean()
print(f"Montos redondos: {round_fraud_rate:.4f} vs Normal: {normal_fraud_rate:.4f}")
print(f"Factor de riesgo: {round_fraud_rate/normal_fraud_rate:.1f}x más probable")

## Balance Inconsistency Detection

Detecta cuando los saldos no cuadran matemáticamente (señal de manipulación)



In [3]:
from src.feature_engineerings import create_balance_features

# Aplicar
df = create_balance_features(df_sample)

# Análisis de efectividad
inconsistent_fraud = df[df['total_inconsistency'] > 0]['isFraud'].mean()
consistent_fraud = df[df['total_inconsistency'] == 0]['isFraud'].mean()
print(f"Inconsistencias: {inconsistent_fraud:.4f} vs Consistentes: {consistent_fraud:.4f}")

Inconsistencias: 0.0007 vs Consistentes: 0.0112


## Velocity Banking Detection

Detecta cuando los saldos no cuadran matemáticamente (señal de manipulación)
Identifica layering (fondos que se mueven muy rápido)



In [4]:
from src.feature_engineerings import create_balance_features
# Aplicar
df = create_balance_features(df_sample)

# Análisis de efectividad
inconsistent_fraud = df[df['total_inconsistency'] > 0]['isFraud'].mean()
consistent_fraud = df[df['total_inconsistency'] == 0]['isFraud'].mean()
print(f"Inconsistencias: {inconsistent_fraud:.4f} vs Consistentes: {consistent_fraud:.4f}")

Inconsistencias: 0.0007 vs Consistentes: 0.0112


## Suspicious Pattern Detection

Detecta patrones de comportamiento anómalos típicos de fraude



In [5]:
from src.feature_engineerings import create_pattern_features

# Aplicar  
df = create_pattern_features(df_sample)

# Análisis de patrones
pattern_analysis = df.groupby(['off_hours', 'suspicious_frequency'])['isFraud'].mean()
print("Análisis de patrones sospechosos:")
print(pattern_analysis)

Análisis de patrones sospechosos:
off_hours  suspicious_frequency
0          0                       0.000760
1          0                       0.008689
Name: isFraud, dtype: float64


# 📊 Features Estadísticas


## Statistical Anomaly Detection

- Detecta transacciones que son estadísticamente anómalas. 
- Z-scores > 3 tienen 10x más probabilidad de ser investigados



In [6]:
from src.feature_engineerings import create_statistical_features


# Aplicar
df = create_statistical_features(df_sample)

# Validación de anomalías
extreme_fraud = df[df['extreme_z_score'] == 1]['isFraud'].mean()
normal_fraud = df[df['extreme_z_score'] == 0]['isFraud'].mean()
print(f"Z-scores extremos: {extreme_fraud:.4f} vs Normal: {normal_fraud:.4f}")

Z-scores extremos: 0.0314 vs Normal: 0.0008


## Rolling Window Features

Comportamiento vs historial del cliente

In [7]:
from src.feature_engineerings import create_rolling_features

# Aplicar
df = create_rolling_features(df_sample)

# Análisis de efectividad
unusual_fraud = df[df['unusual_vs_history_10'] == 1]['isFraud'].mean()
normal_fraud = df[df['unusual_vs_history_10'] == 0]['isFraud'].mean()
print(f"Inusual vs historia: {unusual_fraud:.4f} vs Normal: {normal_fraud:.4f}")

Inusual vs historia: nan vs Normal: 0.0011


## Categorical Encoding

Convertir variables categóricas a formato numérico

In [8]:
from src.feature_engineerings import encode_categorical_features

# Aplicar
df, label_encoder = encode_categorical_features(df_sample)
print("Variables categóricas codificadas exitosamente")
#print(f"Total de variables: {type(df)}")
print(df[[col for col in df.columns if 'hour' in col]].head())



Categorical features encoded:
Total features: 55
Variables categóricas codificadas exitosamente
   hour_of_day  off_hours  hour_sin      hour_cos
0           18          0 -1.000000 -1.836970e-16
1           15          0 -0.707107 -7.071068e-01
2           13          0 -0.258819 -9.659258e-01
3           15          0 -0.707107 -7.071068e-01
4           19          0 -0.965926  2.588190e-01


## Feature Scaling & Normalization

Normalizar escalas para algoritmos ML

In [9]:
from src.feature_engineerings import  scale_features
# Aplicar
X_train, X_test, y_train, y_test, scaler = scale_features(df_sample)

Features escaladas: 42 variables
Training set: 80000 muestras
Test set: 20000 muestras
Tasa de fraude train: 0.0011
Tasa de fraude test: 0.0010


## Feature Selection & Importance

Seleccionar las features más importantes para detección

In [10]:

from src.feature_engineerings import  select_important_features
# Aplicar
feature_importance, selected_features, selector = select_important_features(X_train, y_train)

# Crear dataset final optimizado
X_train_final = X_train[selected_features]
X_test_final  = X_test[selected_features]

print(f"\n🎯 Dataset final: {X_train_final.shape[1]} features")
print("Listo para entrenamiento de modelos ML!")


🔥 TOP 15 FEATURES MÁS IMPORTANTES:
 1. balance_inconsistent_orig 0.2017
 2. balance_error_orig        0.1639
 3. account_drained           0.1376
 4. oldbalanceOrg             0.0597
 5. total_inconsistency       0.0594
 6. newbalanceOrig            0.0585
 7. log_z_score               0.0351
 8. amount_percentile         0.0272
 9. type_encoded              0.0256
10. z_score_amount            0.0242
11. amount                    0.0221
12. z_score_PAYMENT           0.0215
13. log_amount                0.0208
14. step                      0.0182
15. hour_of_day               0.0143
16. top_5_percent             0.0142
17. z_score_CASH_OUT          0.0131
18. newbalanceDest            0.0129
19. z_score_TRANSFER          0.0108
20. balance_error_dest        0.0098

✅ 20 features seleccionadas automáticamente

🎯 Dataset final: 20 features
Listo para entrenamiento de modelos ML!


  f = msb / msw
