# Práctica 4: Clasificación Semi-Supervisada

**Objetivo**: El objetivo de esta práctica es introducir los conceptos de clasificación semi-supervisada.

La práctica consiste en dos tareas:

## TAREA 1: Implementación de un método
- Seleccione un algoritmo de los indicados en la teoría que implemente aprendizaje semi-supervisado de cualquiera de los paradigmas estudiados.
- Implemente el algoritmo.
- Seleccione al menos un dataset semi-supervisado y evalúe el algoritmo implementado.

## TAREA 2: Comparación de métodos
1. Seleccione al menos dos algoritmos de los disponibles en las bibliotecas indicadas.
2. Seleccione al menos tres problemas semi-supervisados de los repositorios indicados.
3. Aplique los algoritmos seleccionados a los datasets.
4. Compare los resultados y explique las conclusiones obtenidas.

En este notebook se incluye un ejemplo con dos tareas:
1. **Tarea 1**: Implementación y uso de un método de Self-Training.
2. **Tarea 2**: Comparación de Label Spreading y Self-Training sobre datasets generados sintéticamente.


## TAREA 1: Implementación de un método de Self-Training

En esta celda se muestra la implementación de un algoritmo de *Self-Training* desde cero y su posterior evaluación en un dataset sintético. Se entrena un modelo con un pequeño subconjunto etiquetado y se itera sobre instancias no etiquetadas, pseudo-etiquetándolas de forma progresiva si el modelo está suficientemente seguro (superando un determinado umbral de confianza).

In [None]:
import numpy as np
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier

def self_training(X_labeled, y_labeled, X_unlabeled, base_classifier,
                  confidence_threshold=0.9, max_iter=10):
    """
    Self-Training method.

    Parameters
    ----------
    X_labeled : array-like, shape (n_labeled_samples, n_features)
        Labeled dataset.
    y_labeled : array-like, shape (n_labeled_samples,)
        Corresponding labels for X_labeled.
    X_unlabeled : array-like, shape (n_unlabeled_samples, n_features)
        Unlabeled dataset.
    base_classifier : scikit-learn estimator
        Base classifier that must support at least the `predict_proba` method.
        (If it does not support `predict_proba`, confidence is assumed to be 1 for all instances).
    confidence_threshold : float, optional (default=0.9)
        Minimum probability threshold to accept pseudo-labeled instances.
    max_iter : int, optional (default=10)
        Maximum number of self-training iterations.

    Returns
    -------
    trained_classifier : scikit-learn estimator
        The trained classifier at the end of the process (or after exhausting
        the unlabeled instances that meet the confidence threshold).
    """

    # Copies to avoid modifying the original data
    X_l = X_labeled.copy()
    y_l = y_labeled.copy()
    X_u = X_unlabeled.copy()

    for i in range(1, max_iter + 1):
        # Create/Clone the classifier to train it from scratch in each iteration
        current_clf = clone(base_classifier)
        current_clf.fit(X_l, y_l)

        # If the classifier has predict_proba, use it to calculate confidence
        if hasattr(current_clf, "predict_proba"):
            probs = current_clf.predict_proba(X_u)
            pred_labels = np.argmax(probs, axis=1)
            max_probs = np.max(probs, axis=1)
        else:
            # If it does not have predict_proba, use predict and assume confidence=1
            pred_labels = current_clf.predict(X_u)
            max_probs = np.ones(len(X_u))

        # Select instances with confidence >= confidence_threshold
        high_conf_idx = np.where(max_probs >= confidence_threshold)[0]

        if len(high_conf_idx) == 0:
            print(f"Iteration {i}: No instances found with confidence >= {confidence_threshold}. Ending.")
            break

        # Add to the labeled dataset
        X_l = np.vstack([X_l, X_u[high_conf_idx]])
        y_l = np.hstack([y_l, pred_labels[high_conf_idx]])

        # Remove from the unlabeled dataset
        X_u = np.delete(X_u, high_conf_idx, axis=0)

        print(f"Iteration {i}: Added {len(high_conf_idx)} high-confidence instances.")

        # If no more unlabeled examples remain, end
        if len(X_u) == 0:
            print(f"Iteration {i}: No more unlabeled instances remaining. Ending.")
            break

    # Train a final classifier with all accumulated labeled data
    final_clf = clone(base_classifier)
    final_clf.fit(X_l, y_l)

    return final_clf

# Example usage with synthetic data
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

    # Generate synthetic dataset
    X, y = make_classification(n_samples=1000, n_features=10,
                               n_informative=3, n_classes=2,
                               random_state=42)

    # Split into train/test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=0.3,
                                                        random_state=42)

    # Take only 20% of y_train as labeled
    labeled_percentage = 0.2
    num_labeled = int(len(y_train) * labeled_percentage)
    X_labeled = X_train[:num_labeled]
    y_labeled = y_train[:num_labeled]

    # The rest is considered unlabeled
    X_unlabeled = X_train[num_labeled:]

    # Train using Self-Training with a RandomForest
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    self_trained_model = self_training(
        X_labeled,
        y_labeled,
        X_unlabeled,
        rf,
        confidence_threshold=0.9,
        max_iter=10
    )

    # Evaluate on the test set
    y_pred = self_trained_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Test accuracy after Self-Training:", accuracy)


## TAREA 2: Comparación de Label Spreading y Self-Training

En la siguiente celda se ilustra cómo comparar dos métodos de clasificación semi-supervisada (Label Spreading y Self-Training) en al menos dos datasets sintéticos. En una entrega final, pueden agregarse más datasets o reemplazar estos por otros de repositorios reconocidos, tal como lo solicita la consigna.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_moons, make_classification
from sklearn.impute import SimpleImputer

# Generate synthetic semi-supervised datasets
def generate_datasets():
    datasets = {}

    # Dataset 1: Moons dataset (easy clustering structure)
    X_moons, y_moons = make_moons(n_samples=300, noise=0.2, random_state=42)
    # Remove labels from 70% of the data
    y_moons[np.random.choice(len(y_moons), size=int(0.7 * len(y_moons)), replace=False)] = -1
    datasets["Moons"] = (X_moons, y_moons)

    # Dataset 2: More complex synthetic classification dataset
    X_class, y_class = make_classification(n_samples=500, n_features=20, n_informative=15, 
                                          n_clusters_per_class=1, random_state=42)
    # Remove labels from 60% of the data
    y_class[np.random.choice(len(y_class), size=int(0.6 * len(y_class)), replace=False)] = -1
    datasets["Complex Classification"] = (X_class, y_class)

    return datasets

# Define evaluation function
def evaluate_semi_supervised(model, X, y, dataset_name):
    # Only use labeled data for training/testing splits
    labeled_mask = y != -1  # Identify labeled data
    X_train, X_test, y_train, y_test = train_test_split(X[labeled_mask], y[labeled_mask], 
                                                        test_size=0.3, random_state=42)

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print(f"Dataset: {dataset_name}")
    print(f"Accuracy: {acc:.4f}, F1 Score: {f1:.4f}\n")
    return acc, f1, cm

# Load datasets
datasets = generate_datasets()

# Initialize models
label_spreading = LabelSpreading(kernel='knn', n_neighbors=5)
self_training = SelfTrainingClassifier(estimator=SVC(probability=True), max_iter=10)

# Evaluate models
results = {}
for dataset_name, (X, y) in datasets.items():
    print(f"Evaluating {dataset_name} with Label Spreading")
    results[(dataset_name, 'Label Spreading')] = evaluate_semi_supervised(label_spreading, X, y, dataset_name)

    print(f"Evaluating {dataset_name} with Self-Training")
    results[(dataset_name, 'Self-Training')] = evaluate_semi_supervised(self_training, X, y, dataset_name)

# Plot accuracy comparison
dataset_names = list(datasets.keys())
methods = ['Label Spreading', 'Self-Training']
accuracies = np.array([[results[(ds, method)][0] for method in methods] for ds in dataset_names])

fig, ax = plt.subplots()
x = np.arange(len(dataset_names))
width = 0.35
ax.bar(x - width/2, accuracies[:, 0], width, label='Label Spreading')
ax.bar(x + width/2, accuracies[:, 1], width, label='Self-Training')
ax.set_ylabel('Accuracy')
ax.set_title('Comparison of Semi-Supervised Learning Methods')
ax.set_xticks(x)
ax.set_xticklabels(dataset_names)
ax.legend()
plt.show()


### Conclusiones y Tareas Pendientes
- Se ha mostrado cómo implementar un esquema de **Self-Training** manualmente y cómo comparar con el método **Label Spreading**.
- Para una práctica más completa, se recomienda:
  1. Probar ambos algoritmos (o más) en *tres o más* datasets, preferentemente tomados de repositorios públicos o con distintas características.
  2. Analizar la sensibilidad a distintos porcentajes de instancias no etiquetadas (variar el porcentaje de eliminación de etiquetas).
  3. Ajustar hiperparámetros (e.g., `confidence_threshold` en Self-Training, `n_neighbors` en LabelSpreading) para ver qué configuraciones obtienen mejores resultados.
