# **Pseudo-Labeling Model Theory**


## Pseudo-Labeling

---

## Theory
Pseudo-Labeling is a semi-supervised learning technique used to improve the performance of a model when labeled data is scarce. It leverages the model's own predictions to generate pseudo-labels for the unlabeled data. This approach combines labeled and unlabeled data to create a more robust model without the need for additional human labeling.

The main idea is to:
- Train a model on the labeled dataset.
- Use the model to generate pseudo-labels for the unlabeled data.
- Retrain the model using both the labeled data and the newly generated pseudo-labeled data.

---

## Mathematical Foundation
- **Labeled Data**: \( D_{\text{labeled}} = \{(x_i, y_i)\} \), where \( x_i \) represents the feature vector and \( y_i \) the true label.

- **Unlabeled Data**: \( D_{\text{unlabeled}} = \{x_i\} \), where the labels \( y_i \) are unknown.

- **Pseudo-Labels**: For each unlabeled data point \( x_i \), the model generates a pseudo-label \( \hat{y}_i \) based on its predicted probability:
  $$ \hat{y}_i = \text{model}(x_i) $$

- **Loss Function**:
  The model is trained with a combined loss function for both labeled and pseudo-labeled data:
  $$ L_{\text{total}} = L_{\text{labeled}} + \lambda L_{\text{pseudo}} $$ 
  Where:
  - \( L_{\text{labeled}} \) is the loss for the labeled data.
  - \( L_{\text{pseudo}} \) is the loss for the pseudo-labeled data.
  - \( \lambda \) is a weighting factor that controls the influence of pseudo-labels.

---

## Algorithm Steps
1. **Train on Labeled Data**:
   - Train the model using the labeled dataset \( D_{\text{labeled}} \).

2. **Generate Pseudo-Labels**:
   - Use the trained model to predict labels for the unlabeled data \( D_{\text{unlabeled}} \).

3. **Filter Pseudo-Labels**:
   - Optionally, select only confident predictions (e.g., predictions above a certain threshold).

4. **Retrain the Model**:
   - Combine the labeled data and pseudo-labeled data and retrain the model using the combined dataset.

5. **Evaluate the Model**:
   - Evaluate the model's performance on a validation set or test set.

---

## Key Parameters
- **Threshold**: Confidence threshold for accepting pseudo-labels.
- **Weighting Factor \( \lambda \)**: Balances the contribution of labeled and pseudo-labeled data.
- **Unlabeled Data**: Amount of unlabeled data available for pseudo-labeling.

---

## Advantages
- **Reduces Labeling Cost**: Leverages unlabeled data, which is often abundant.
- **Improves Model Performance**: Utilizes the full potential of available data.
- **Scalable**: Effective for large datasets where labeling is expensive or time-consuming.

---

## Disadvantages
- **Model Bias**: Incorrect pseudo-labels can introduce noise and harm model performance.
- **Confidence Threshold Tuning**: Choosing the right threshold for pseudo-label selection is crucial.
- **Risk of Overfitting**: Over-relying on pseudo-labeled data may lead to overfitting on noisy labels.

---

## Implementation Tips
- **Threshold Selection**: Use cross-validation or grid search to find the optimal confidence threshold.
- **Iterative Refinement**: Apply pseudo-labeling iteratively, refining the model over time.
- **Regularization**: Use regularization techniques to prevent overfitting, especially when using noisy pseudo-labels.
- **Use a Robust Base Model**: A well-trained model is crucial for generating high-quality pseudo-labels.

---

## Applications
- **Image Classification**: Labeling large amounts of unlabeled image data.
- **Natural Language Processing**: Generating pseudo-labels for text classification tasks.
- **Medical Imaging**: Annotating medical images where labeled data is limited.
- **Speech Recognition**: Using unlabeled audio data to improve speech recognition models.

Pseudo-labeling is a simple but effective technique to leverage unlabeled data and improve model performance when labeled data is scarce. It is especially useful in real-world scenarios where obtaining labeled data can be expensive or time-consuming.


## **Model Evaluation for Pseudo-Labeling**

### 1. Accuracy

**Formula:**
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

**Description:**
- Measures the overall correctness of the pseudo-labeling process.
- Compares the number of correctly pseudo-labeled samples to the total pseudo-labeled samples.

**Interpretation:**
- Higher accuracy indicates better pseudo-labeling quality.
- Can be misleading if the dataset is imbalanced.

---

### 2. Precision

**Formula:**
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

**Description:**
- Measures the proportion of correctly pseudo-labeled positive samples out of all predicted positives.
- Helps assess the reliability of pseudo-labels generated by the model.

**Interpretation:**
- High precision means fewer false positives in the pseudo-labeling process.
- Important when incorrect pseudo-labels can harm the model.

---

### 3. Recall (Sensitivity)

**Formula:**
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Description:**
- Measures how many actual positive samples were correctly pseudo-labeled.
- Important when missing pseudo-labels for positives is costly.

**Interpretation:**
- High recall indicates fewer missed positive samples.
- Crucial for ensuring the model captures as many relevant pseudo-labels as possible.

---

### 4. F1-Score

**Formula:**
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

**Description:**
- A balance between precision and recall in pseudo-labeling.
- Useful when evaluating the trade-off between the two.

**Interpretation:**
- Higher F1-score indicates a balanced performance.
- Important when both false positives and false negatives are problematic.

---

### 5. Confusion Matrix

**Description:**
- A table summarizing true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for pseudo-labeled data.

**Interpretation:**
- Helps visualize errors in pseudo-labeling.
- Important to analyze both pseudo-labeled and true labeled samples.

---

### 6. AUC-ROC Curve

**Description:**
- Plots True Positive Rate (TPR) vs. False Positive Rate (FPR) for pseudo-labeled data.
- AUC (Area Under Curve) measures the performance of the pseudo-labeling process.

**Interpretation:**
- **AUC = 1** → Perfect pseudo-labeling.
- **AUC > 0.8** → Strong pseudo-labeling.
- **AUC = 0.5** → Random pseudo-labeling.

---

### 7. Labeling Accuracy (Pseudo-Label Quality)

**Formula:**
$$
\text{Labeling Accuracy} = \frac{\text{Correctly Pseudo-Labeled Samples}}{\text{Total Pseudo-Labeled Samples}}
$$

**Description:**
- Measures how accurately the model assigns pseudo-labels to unlabeled data.

**Interpretation:**
- Higher accuracy means better pseudo-labeling quality.
- Poor labeling can degrade the overall model performance.

---

### 8. Number of Iterations Until Convergence

**Description:**
- Pseudo-labeling often involves multiple iterations to refine labels.
- The number of iterations until performance stabilizes is an important metric.

**Interpretation:**
- Too many iterations may indicate overfitting.
- Faster convergence is desirable for efficient model training.

---

### 9. k-Fold Cross Validation

**Description:**
- Splits the dataset into \( k \) subsets to evaluate the generalization of pseudo-labels.
- Helps assess the stability of the model after pseudo-labeling.

**Interpretation:**
- Reduces overfitting risk.
- Provides a more reliable estimate of model performance after pseudo-labeling.

---


## Pseudo-Labeling (General Overview)

### Pseudo-Labeling Process (Implemented in PyTorch)

| **Parameter**   | **Description**                                                                 |
|-----------------|-------------------------------------------------------------------------------|
| base_model      | The neural network model used for predictions and pseudo-label generation.    |
| threshold       | Minimum confidence required to assign pseudo-labels to unlabeled samples.     |
| max_iter        | Number of iterations for the pseudo-labeling process.                         |
| unlabeled_data  | The dataset that contains the unlabeled samples.                             |
| labeled_data    | The dataset that contains the labeled samples.                               |
| batch_size      | Number of samples per iteration for training.                                |
| learning_rate   | The rate at which the model's weights are updated during training.           |

-

| **Attribute**         | **Description**                                                                 |
|-----------------------|-------------------------------------------------------------------------------|
| pseudo_labels         | The pseudo-labels assigned to the unlabeled samples based on the model's predictions. |
| confidence_threshold  | The threshold probability above which predictions are considered confident. |
| iteration             | The current iteration number in the pseudo-labeling process.                 |

-

| **Method**            | **Description**                                                                 |
|-----------------------|-------------------------------------------------------------------------------|
| fit(X, y)             | Train the model on the labeled data and pseudo-labeled data in an iterative process. |
| predict(X)            | Predict labels for input data `X` using the trained model.                   |
| predict_proba(X)      | Predict class probabilities for input data `X`.                              |

-

[Documentation](https://pytorch.org/tutorials/beginner/supervised_learning/semisupervised_learning_tutorial.html)


# Pseudo-Labeling - Example

## Data loading

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np

# 1. Data Loading and Processing
# Load the Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Split into labeled and unlabeled sets (95% unlabeled)
X_train, X_unlabeled, y_train, y_unlabeled = train_test_split(X, y, test_size=0.95, random_state=42)

# Set some of the labels to -1 for unlabeled
y_train[-100:] = -1  # 100 samples as unlabeled

# Convert to torch tensors
X_train_tensor = torch.Tensor(X_train)
y_train_tensor = torch.Tensor(y_train).long()
X_unlabeled_tensor = torch.Tensor(X_unlabeled)

# Create DataLoader
labeled_dataset = TensorDataset(X_train_tensor, y_train_tensor)
labeled_loader = DataLoader(labeled_dataset, batch_size=32, shuffle=True)

# 2. Model Definition
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(X_train.shape[1], 64)
        self.fc2 = nn.Linear(64, 10)  # 10 output classes (digits 0-9)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize model, loss function, and optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 3. Initial Training on Labeled Data
def train(model, labeled_loader):
    model.train()
    for inputs, labels in labeled_loader:
        # Remove samples where label is -1
        valid_indices = labels != -1
        inputs = inputs[valid_indices]
        labels = labels[valid_indices]
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()


# Train on labeled data
train(model, labeled_loader)

# 4. Generate Pseudo-Labels for Unlabeled Data
def generate_pseudo_labels(model, X_unlabeled_tensor, threshold=0.75):
    model.eval()
    with torch.no_grad():
        outputs = model(X_unlabeled_tensor)
        probabilities = torch.softmax(outputs, dim=1)
        pseudo_labels = torch.max(probabilities, dim=1)[1]
        confidence = torch.max(probabilities, dim=1)[0]
        
    # Apply threshold for confidence
    high_confidence_mask = confidence > threshold
    pseudo_labels[~high_confidence_mask] = -1  # Set uncertain predictions back to -1
    return pseudo_labels

# Get pseudo-labels for unlabeled data
pseudo_labels = generate_pseudo_labels(model, X_unlabeled_tensor)

# 5. Combine Labeled and Pseudo-Labeled Data
X_combined = torch.cat((X_train_tensor, X_unlabeled_tensor), dim=0)
y_combined = torch.cat((y_train_tensor, pseudo_labels), dim=0)

# Filter out the -1 values (pseudo-labels) for training
combined_dataset = TensorDataset(X_combined, y_combined)
combined_loader = DataLoader(combined_dataset, batch_size=32, shuffle=True)

# 6. Retrain Model Using Both Labeled and Pseudo-Labeled Data
train(model, combined_loader)

# 7. Model Evaluation
# In this example, we're using a subset of unlabeled data for evaluation purposes
y_pred = model(X_unlabeled_tensor)
_, predicted_labels = torch.max(y_pred, 1)

# Evaluate accuracy (since we only have labeled data for evaluation)
accuracy = accuracy_score(y_unlabeled, predicted_labels.numpy())
print(f"Accuracy on pseudo-labeled data: {accuracy:.4f}")


Accuracy on pseudo-labeled data: 0.1019
