# Parte 2: Aprendizaje Semisupervisado para Clasificación de Rating

Este notebook implementa un protocolo de aprendizaje semisupervisado utilizando el dataset procesado, con baseline supervisado, Self-Training y Label Spreading/Propagation (k-NN). Incluye EDA, configuración experimental, evaluación con métricas solicitadas y análisis de hiperparámetros.

In [None]:
# Instalación de dependencias necesarias (ejecutar una vez)
import sys, subprocess
packages = [
    'numpy',
    'pandas',
    'matplotlib',
    'seaborn',
    'scikit-learn',
    'scipy'
]
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q'] + packages)
print('Instalación de dependencias completada')

In [None]:
# Configuración general
import warnings
warnings.filterwarnings('ignore')

import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (f1_score, balanced_accuracy_score, confusion_matrix,
                             roc_curve, auc)
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.manifold import TSNE
from sklearn.semi_supervised import SelfTrainingClassifier, LabelSpreading, LabelPropagation

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_PATH = os.path.join('..', 'data', 'processed', 'dataset.csv')
assert os.path.exists(DATA_PATH), f'No se encuentra el dataset en {DATA_PATH}'
df = pd.read_csv(DATA_PATH)
df.head()

## Estructura de datos y selección de variables
- Las columnas de identificación `cooperativa` y `abreviacion` no se usan como features.
- La columna de label es `Label`.
- Se asume que los demás campos numéricos están preprocesados.

In [None]:
# Definición de columnas y división X / y
id_cols = ['cooperativa', 'abreviacion']
label_col = 'Label'
assert label_col in df.columns, 'No se encontró la columna de label esperada'
feature_cols = [c for c in df.columns if c not in id_cols + [label_col]]
X = df[feature_cols].copy()
y = df[label_col].astype(str).copy()
classes_ = np.unique(y)
n_classes = len(classes_)
X.shape, y.value_counts().to_dict()

## EDA: Distribuciones, correlaciones y t-SNE
Exploramos la distribución de indicadores, correlaciones y redundancias, y visualizamos con t-SNE.

In [None]:
# Distribución de indicadores (histogramas)
_ = X.hist(figsize=(16, 12), bins=20)
plt.tight_layout()
plt.show()

# Matriz de correlación
plt.figure(figsize=(12,10))
corr = X.corr(numeric_only=True)
sns.heatmap(corr, cmap='coolwarm', center=0)
plt.title('Matriz de correlación')
plt.show()

# Redundancias: pares con |corr| > 0.9
high_corr = []
thr = 0.9
for i, c1 in enumerate(feature_cols):
    for j, c2 in enumerate(feature_cols):
        if j <= i:
            continue
        val = corr.loc[c1, c2]
        if abs(val) > thr:
            high_corr.append((c1, c2, float(val)))
pd.DataFrame(high_corr, columns=['feat1','feat2','corr']).sort_values('corr', key=lambda s: s.abs(), ascending=False)


In [None]:
# t-SNE para visualización
tsne = TSNE(n_components=2, random_state=RANDOM_STATE, perplexity=int(np.clip(len(X)//5, 5, 30)))
emb = tsne.fit_transform(X)
emb_df = pd.DataFrame(emb, columns=['tsne1','tsne2'])
emb_df[label_col] = y.values
plt.figure(figsize=(8,6))
sns.scatterplot(data=emb_df, x='tsne1', y='tsne2', hue=label_col, palette='tab10')
plt.title('t-SNE de indicadores (color=Label)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


## 2.1 Configuración y protocolo
- `p ∈ {5%, 10%, 20%, 40%, 60%, 80%}` como fracción etiquetada en Train.
- 10 repeticiones por `p`.
- Preprocesamiento: imputación, escalado y filtro de varianza.
- Semilla fija para reproducibilidad.

In [None]:
# Pipeline de preprocesamiento y utilidades
preprocess = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('varth', VarianceThreshold(threshold=0.0))
])

def make_rf(random_state=RANDOM_STATE):
    return RandomForestClassifier(n_estimators=400, max_depth=None, n_jobs=-1, random_state=random_state)

def compute_metrics(y_true, y_pred, labels):
    f1_macro = f1_score(y_true, y_pred, average='macro', labels=labels)
    bal_acc = balanced_accuracy_score(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    return f1_macro, bal_acc, cm

def compute_roc_auc(y_true, proba, classes):
    y_bin = label_binarize(y_true, classes=classes)
    fpr = {}; tpr = {}; roc_auc = {}
    for i, c in enumerate(classes):
        fpr[c], tpr[c], _ = roc_curve(y_bin[:, i], proba[:, i])
        from numpy import trapz
        roc_auc[c] = np.trapz(tpr[c], fpr[c])
    all_fpr = np.unique(np.concatenate([fpr[c] for c in classes]))
    mean_tpr = np.zeros_like(all_fpr)
    for c in classes:
        mean_tpr += np.interp(all_fpr, fpr[c], tpr[c])
    mean_tpr /= len(classes)
    macro_auc = np.trapz(mean_tpr, all_fpr)
    return roc_auc, macro_auc, fpr, tpr


## 2.2 Modelos: Baseline y Semisupervisados
- Baseline: Random Forest entrenado solo con la porción etiquetada.
- Self-Training: RF base con umbral de confianza `τ`.
- Label Spreading/Propagation: kernel `knn` con `k` vecinos (distancia euclidiana tras escalado).

In [None]:
from copy import deepcopy

def fit_predict_baseline(X_l, y_l, X_te):
    pipe = Pipeline(steps=[('prep', preprocess), ('clf', make_rf())])
    pipe.fit(X_l, y_l)
    y_pred = pipe.predict(X_te)
    proba = pipe.predict_proba(X_te) if hasattr(pipe.named_steps['clf'], 'predict_proba') else None
    return y_pred, proba, pipe

def fit_predict_self_training(X_tr, y_tr_with_unlabeled, X_te, threshold):
    base = make_rf()
    st = SelfTrainingClassifier(base_estimator=base, threshold=threshold, verbose=False)
    pipe = Pipeline(steps=[('prep', preprocess), ('clf', st)])
    pipe.fit(X_tr, y_tr_with_unlabeled)
    y_pred = pipe.predict(X_te)
    proba = None
    try:
        proba = pipe.predict_proba(X_te)
    except Exception:
        pass
    return y_pred, proba, pipe

def fit_predict_label_graph(X_tr, y_tr_with_unlabeled, X_te, method='spreading', k=10):
    if method == 'spreading':
        lg = LabelSpreading(kernel='knn', n_neighbors=int(k), alpha=0.2)
    else:
        lg = LabelPropagation(kernel='knn', n_neighbors=int(k))
    scaler = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
    X_tr_s = scaler.fit_transform(X_tr)
    lg.fit(X_tr_s, y_tr_with_unlabeled)
    X_te_s = scaler.transform(X_te)
    y_pred = lg.predict(X_te_s)
    proba = None
    try:
        proba = lg.predict_proba(X_te_s)
    except Exception:
        pass
    return y_pred, proba, lg


## 2.3 Protocolo de validación y corridas
Split Train/Test (20% test). Para cada `p` y repetición, muestreo estratificado etiquetado dentro de Train; el resto se trata como no etiquetado.

In [None]:
# División base Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)

ratios = [0.05, 0.10, 0.20, 0.40, 0.60, 0.80]
repeats = 10
taus = [0.6, 0.7, 0.8, 0.9]
knns = [5, 10, 20]

results = []

for p in ratios:
    for rep in range(repeats):
        rs = RANDOM_STATE + rep + int(p*1000)
        X_tr, y_tr = X_train.copy(), y_train.copy()
        X_l, _, y_l, _ = train_test_split(X_tr, y_tr, train_size=p, stratify=y_tr, random_state=rs)
        y_tr_semi = pd.Series(-1, index=X_tr.index)
        y_tr_semi.loc[y_l.index] = y_l.values
        # Baseline
        yb_pred, yb_proba, _ = fit_predict_baseline(X_l, y_l, X_test)
        b_f1, b_bal, _ = compute_metrics(y_test, yb_pred, classes_)
        b_auc_macro = None
        if yb_proba is not None:
            _, b_auc_macro, _, _ = compute_roc_auc(y_test, yb_proba, classes_)
        results.append({'p': p, 'rep': rep, 'method': 'baseline', 'param': None, 'f1_macro': b_f1, 'bal_acc': b_bal, 'auc_macro': b_auc_macro})
        # Self-Training
        for tau in taus:
            ys_pred, ys_proba, _ = fit_predict_self_training(X_tr, y_tr_semi.values, X_test, threshold=tau)
            s_f1, s_bal, _ = compute_metrics(y_test, ys_pred, classes_)
            s_auc_macro = None
            if ys_proba is not None:
                _, s_auc_macro, _, _ = compute_roc_auc(y_test, ys_proba, classes_)
            results.append({'p': p, 'rep': rep, 'method': 'self_training', 'param': {'tau': tau}, 'f1_macro': s_f1, 'bal_acc': s_bal, 'auc_macro': s_auc_macro, 'delta_f1': s_f1 - b_f1, 'delta_bal': s_bal - b_bal})
        # Label Spreading/Propagation
        for k in knns:
            for meth in ['spreading', 'propagation']:
                yl_pred, yl_proba, _ = fit_predict_label_graph(X_tr, y_tr_semi.values, X_test, method=meth, k=k)
                l_f1, l_bal, _ = compute_metrics(y_test, yl_pred, classes_)
                l_auc_macro = None
                if yl_proba is not None:
                    _, l_auc_macro, _, _ = compute_roc_auc(y_test, yl_proba, classes_)
                results.append({'p': p, 'rep': rep, 'method': f'label_{meth}', 'param': {'k': k}, 'f1_macro': l_f1, 'bal_acc': l_bal, 'auc_macro': l_auc_macro, 'delta_f1': l_f1 - b_f1, 'delta_bal': l_bal - b_bal})

results_df = pd.DataFrame(results)
results_df.head()


## 2.3 Métricas de evaluación y comparación vs baseline
- Macro F1, Balanced Accuracy, ROC-AUC (macro).
- Ganancia vs baseline: ΔMacro-F1 y ΔBalanced-Acc.
- Matriz de confusión por `p` (promedio sobre repeticiones).

In [None]:
# Resumen por método y p
agg = results_df.groupby(['method','p']).agg(
    f1_macro_mean=('f1_macro','mean'), f1_macro_std=('f1_macro','std'),
    bal_acc_mean=('bal_acc','mean'), bal_acc_std=('bal_acc','std'),
    auc_macro_mean=('auc_macro','mean')
).reset_index()
agg.sort_values(['method','p'])


In [None]:
# Curvas desempeño vs ratio_labeled p
plt.figure(figsize=(10,6))
for m in agg['method'].unique():
    sub = agg[agg['method']==m]
    plt.plot(sub['p'], sub['f1_macro_mean'], marker='o', label=f'{m} - F1 macro')
plt.xlabel('ratio_labeled p')
plt.ylabel('F1 macro (mean)')
plt.title('Desempeño F1 macro vs p')
plt.legend()
plt.show()

plt.figure(figsize=(10,6))
for m in agg['method'].unique():
    sub = agg[agg['method']==m]
    plt.plot(sub['p'], sub['bal_acc_mean'], marker='o', label=f'{m} - Balanced Acc')
plt.xlabel('ratio_labeled p')
plt.ylabel('Balanced Accuracy (mean)')
plt.title('Desempeño Balanced Accuracy vs p')
plt.legend()
plt.show()


In [None]:
# Efecto de τ en Self-Training y de k en Label Graph
st_sub = results_df[results_df['method']=='self_training']
if not st_sub.empty:
    st_sub = st_sub.copy()
    st_sub['tau'] = st_sub['param'].apply(lambda d: d.get('tau') if isinstance(d, dict) else None)
    plt.figure(figsize=(10,6))
    for tau in sorted([t for t in st_sub['tau'].dropna().unique()]):
        tmp = st_sub[st_sub['tau']==tau].groupby('p').agg(f1=('f1_macro','mean')).reset_index()
        plt.plot(tmp['p'], tmp['f1'], marker='o', label=f'tau={tau}')
    plt.title('Self-Training: F1 macro vs p por τ')
    plt.xlabel('p')
    plt.ylabel('F1 macro (mean)')
    plt.legend()
    plt.show()

for meth in ['label_spreading','label_propagation']:
    lg_sub = results_df[results_df['method']==meth]
    if lg_sub.empty:
        continue
    lg_sub = lg_sub.copy()
    lg_sub['k'] = lg_sub['param'].apply(lambda d: d.get('k') if isinstance(d, dict) else None)
    plt.figure(figsize=(10,6))
    for k in sorted([int(v) for v in lg_sub['k'].dropna().unique()]):
        tmp = lg_sub[lg_sub['k']==k].groupby('p').agg(f1=('f1_macro','mean')).reset_index()
        plt.plot(tmp['p'], tmp['f1'], marker='o', label=f'k={k}')
    plt.title(f'{meth}: F1 macro vs p por k')
    plt.xlabel('p')
    plt.ylabel('F1 macro (mean)')
    plt.legend()
    plt.show()


In [None]:
# Ganancia vs baseline (Δ) para cada método
base = results_df[results_df['method']=='baseline'][['p','rep','f1_macro','bal_acc']].rename(columns={'f1_macro':'b_f1','bal_acc':'b_bal'})
merged = results_df.merge(base, on=['p','rep'], how='left')
merged['delta_f1_calc'] = merged['f1_macro'] - merged['b_f1']
merged['delta_bal_calc'] = merged['bal_acc'] - merged['b_bal']
gain = merged[merged['method']!='baseline'].groupby(['method','p']).agg(
    dF1_mean=('delta_f1_calc','mean'), dF1_std=('delta_f1_calc','std'),
    dBal_mean=('delta_bal_calc','mean'), dBal_std=('delta_bal_calc','std')
).reset_index()
gain.sort_values(['method','p'])


In [None]:
# Test estadístico vs baseline (t-test pareado por p)
from scipy.stats import ttest_rel

tests = []
for p in ratios:
    base_scores = results_df[(results_df['method']=='baseline') & (results_df['p']==p)]['f1_macro'].values
    for meth in results_df['method'].unique():
        if meth=='baseline':
            continue
        meth_scores = results_df[(results_df['method']==meth) & (results_df['p']==p)]['f1_macro'].values
        if len(base_scores)==len(meth_scores) and len(base_scores)>1:
            t, pval = ttest_rel(meth_scores, base_scores)
            tests.append({'p': p, 'method': meth, 't_stat': t, 'p_value': pval})

tests_df = pd.DataFrame(tests).sort_values(['method','p'])
tests_df

In [None]:
# Matrices de confusión promedio por p y método (ejemplo de visualización)
cm_summary = []
for p in ratios:
    for meth in results_df['method'].unique():
        cms = []
        for rep in range(repeats):
            rs = RANDOM_STATE + rep + int(p*1000)
            X_tr, y_tr = X_train.copy(), y_train.copy()
            X_l, _, y_l, _ = train_test_split(X_tr, y_tr, train_size=p, stratify=y_tr, random_state=rs)
            y_tr_semi = pd.Series(-1, index=X_tr.index)
            y_tr_semi.loc[y_l.index] = y_l.values
            if meth=='baseline':
                y_pred, _, _ = fit_predict_baseline(X_l, y_l, X_test)
            elif meth=='self_training':
                tau = np.median(taus)
                y_pred, _, _ = fit_predict_self_training(X_tr, y_tr_semi.values, X_test, threshold=float(tau))
            elif meth in ['label_spreading','label_propagation']:
                k = 10
                y_pred, _, _ = fit_predict_label_graph(X_tr, y_tr_semi.values, X_test, method=meth.split('_')[1], k=int(k))
            else:
                continue
            cm = confusion_matrix(y_test, y_pred, labels=classes_)
            cms.append(cm)
        if cms:
            cm_mean = np.mean(cms, axis=0)
            cm_summary.append({'p': p, 'method': meth, 'cm_mean': cm_mean})

# Mostrar un ejemplo
if cm_summary:
    ex = cm_summary[0]
    plt.figure(figsize=(6,5))
    sns.heatmap(ex['cm_mean'], annot=True, fmt='.1f', xticklabels=classes_, yticklabels=classes_, cmap='Blues')
    plt.title(f"CM promedio - method={ex['method']} p={ex['p']}")
    plt.ylabel('True')
    plt.xlabel('Pred')
    plt.tight_layout()
    plt.show()

## ROC y AUC (ejemplo de corrida)
Se grafican curvas ROC macro para baseline y un método semisupervisado representativo (mejor F1 macro promedio).

In [None]:
semi_methods = results_df[results_df['method']!='baseline'].groupby('method').agg(mean_f1=('f1_macro','mean')).sort_values('mean_f1', ascending=False)
best_method = semi_methods.index[0] if len(semi_methods)>0 else None
best_method


In [None]:
# Curvas ROC en la última corrida del mejor método
if best_method is not None:
    last = results_df[results_df['method']==best_method].iloc[-1]
    p = last['p']; rep = int(last['rep'])
    rs = RANDOM_STATE + rep + int(p*1000)
    X_tr, y_tr = X_train.copy(), y_train.copy()
    X_l, _, y_l, _ = train_test_split(X_tr, y_tr, train_size=p, stratify=y_tr, random_state=rs)
    y_tr_semi = pd.Series(-1, index=X_tr.index); y_tr_semi.loc[y_l.index] = y_l.values
    _, b_proba, _ = fit_predict_baseline(X_l, y_l, X_test)
    proba_semi = None
    if best_method=='self_training':
        tau = results_df[(results_df['method']=='self_training') & (results_df['p']==p)]['param'].iloc[-1]['tau']
        _, proba_semi, _ = fit_predict_self_training(X_tr, y_tr_semi.values, X_test, threshold=tau)
    elif best_method in ['label_spreading','label_propagation']:
        k = results_df[(results_df['method']==best_method) & (results_df['p']==p)]['param'].iloc[-1]['k']
        _, proba_semi, _ = fit_predict_label_graph(X_tr, y_tr_semi.values, X_test, method=best_method.split('_')[1], k=k)
    if b_proba is not None:
        _, b_auc_macro, fpr_b, tpr_b = compute_roc_auc(y_test, b_proba, classes_)
        plt.figure(figsize=(7,6))
        plt.plot([0,1],[0,1],'k--',alpha=0.3)
        for c in classes_:
            plt.plot(fpr_b[c], tpr_b[c], alpha=0.2)
        plt.title(f'Baseline macro-AUC={b_auc_macro:.3f}')
        plt.show()
    if proba_semi is not None:
        _, s_auc_macro, fpr_s, tpr_s = compute_roc_auc(y_test, proba_semi, classes_)
        plt.figure(figsize=(7,6))
        plt.plot([0,1],[0,1],'k--',alpha=0.3)
        for c in classes_:
            plt.plot(fpr_s[c], tpr_s[c], alpha=0.2)
        plt.title(f'{best_method} macro-AUC={s_auc_macro:.3f}')
        plt.show()


## Interpretabilidad: Importancia de variables (RF) y t-SNE coloreado por label

In [None]:
pipe_all = Pipeline(steps=[('prep', preprocess), ('clf', make_rf())])
pipe_all.fit(X_train, y_train)
rf = pipe_all.named_steps['clf']
try:
    importances = rf.feature_importances_
    imp_df = pd.DataFrame({'feature': feature_cols[:len(importances)], 'importance': importances})
    imp_df = imp_df.sort_values('importance', ascending=False).head(20)
    plt.figure(figsize=(8,6))
    sns.barplot(data=imp_df, x='importance', y='feature')
    plt.title('Top-20 Importancias (RF)')
    plt.tight_layout()
    plt.show()
except Exception as e:
    print('No se pudieron calcular importancias:', e)


## 2.4 Análisis y reporte
- Curvas desempeño vs `p` para cada método.
- Test estadístico (t-test pareado vs baseline por `p`).
- Efecto de `τ` y `k`.
- Discusión de errores frecuentes por clase.

### Notas sobre hiperparámetros y su efecto
- `ratio_labeled (p)`: 
  - ↑p: más supervisión directa; mayor estabilidad.
  - ↓p: mayor dependencia de semisupervisado; mayor sensibilidad a ruido.
- `τ` en Self-Training: 
  - ↑τ: acepta menos pseudoetiquetas (más precisión, menos recall).
  - ↓τ: acepta más pseudoetiquetas (más recall, riesgo de ruido).
- `k` en Label Spreading/Propagation: 
  - ↑k: grafo denso, más suavizado; riesgo de sobre-propagación.
  - ↓k: grafo esparcido; puede fragmentar clases.
- `alpha` (Label Spreading): 
  - ↑alpha: más peso a vecinos.
  - ↓alpha: más peso a etiquetas iniciales.
- Random Forest: 
  - ↑n_estimators: menor varianza, mayor costo.
  - ↓max_depth: más sesgo; ↑max_depth: menor sesgo y posible sobreajuste.
- Preprocesamiento: imputación mediana + escalado + filtro de varianza.
  - El escalado es crítico para k-NN y t-SNE.