# Detección de Fraude con Aprendizaje Semi-Supervisado

## Pseudocódigo Formal del Algoritmo Self-Training

**ALGORITMO** `SelfTraining(D_labeled, D_unlabeled, threshold)`

**ENTRADA:**
- `D_labeled ← {(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)}` (conjunto de datos etiquetados)
- `D_unlabeled ← {x₁', x₂', …, xₘ'}` (conjunto de datos sin etiquetas)
- `threshold` (umbral de confianza para pseudo-etiquetado)

**SALIDA:**
- Modelo entrenado `M_final`
- Conjunto expandido `D_expanded` con pseudo-etiquetas

**PASOS:**

1. **Inicialización:**
   - `M ← entrenar_modelo_base(D_labeled)`
   - `D_temp ← D_labeled`
   - `D_remaining ← D_unlabeled`

2. **MIENTRAS** `D_remaining ≠ ∅` **Y** condición de parada no cumplida:
   
   a. **Predecir con confianza:**
      - Para cada `xᵢ ∈ D_remaining`:
        - `prob_i ← M.predict_proba(xᵢ)`
        - `confidence_i ← MAX(prob_i)`
   
   b. **Seleccionar predicciones de alta confianza:**
      - `D_confident ← {(xᵢ, ŷᵢ) | confidence_i ≥ threshold}`
      - Donde `ŷᵢ ← argmax(prob_i)`
   
   c. **SI** `D_confident = ∅` **ENTONCES**:
      - **ROMPER** (no hay predicciones confiables)
   
   d. **Expandir conjunto etiquetado:**
      - `D_temp ← D_temp ∪ D_confident`
      - `D_remaining ← D_remaining \ D_confident`
   
   e. **Re-entrenar modelo:**
      - `M ← entrenar_modelo(D_temp)`

3. **RETORNAR** modelo final `M` y conjunto expandido `D_temp`



## 1. Importacion de Librerias y Carga de Datos

In [1]:
# Librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, precision_score, recall_score, accuracy_score
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')

# Cargar datos
df_train = pd.read_csv("application_train.csv")
print(f"Datos: {df_train.shape} | Fraude: {df_train['TARGET'].mean()*100:.1f}%")
df_train.head()

Datos: (307511, 122) | Fraude: 8.1%


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
