# **Self-Training Model Theory**


## Theory
Self-Training is a semi-supervised learning algorithm where a classifier is iteratively trained on its own most confident predictions. It uses a small labeled dataset to train an initial model, then labels the unlabeled data and incorporates the most confident predictions back into the training set.

The main idea is to:
- Train a classifier with the labeled data.
- Predict labels for the unlabeled data.
- Use the most confident predictions to augment the labeled dataset.
- Repeat the process to refine the classifier.

## Self-Training Process
1. **Initialization**:
- Start with a small labeled dataset and a large unlabeled dataset.
- Train an initial classifier using the labeled data.

2. **Iteration**:
- Predict labels for the unlabeled data.
- Select confident predictions to add to the labeled dataset.
- Retrain the classifier with the expanded labeled dataset.
- Repeat until convergence or a stopping criterion is met.

## Key Steps
1. **Initial Training**:
- Train a classifier using the labeled data.

2. **Label Prediction**:
- Use the trained classifier to predict labels for the unlabeled data.

3. **Confident Selection**:
- Identify the most confident predictions based on a confidence threshold.

4. **Dataset Augmentation**:
- Add the confident predictions to the labeled dataset.

5. **Retraining**:
- Retrain the classifier with the updated labeled dataset.
- Repeat the process until convergence.

## Mathematical Formulation
1. **Confidence Measure**:
- The confidence of a prediction can be measured using the probability of the predicted class:
$$ \text{Confidence}(x) = \max(p(y|x)) $$

2. **Threshold-Based Selection**:
- Select predictions where the confidence exceeds a predefined threshold:
$$ \{(x, \hat{y}) | \text{Confidence}(x) > \text{threshold} \} $$

## Advantages
- Leverages both labeled and unlabeled data.
- Can improve performance with limited labeled data.
- Simple and easy to implement.

## Applications
- Text classification.
- Image recognition.
- Any domain with a large amount of unlabeled data.



## Model Evaluation for Self-Training Classifier

### 1. Accuracy Score
Formula:
$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$
Description:
- Accuracy measures the ratio of correct predictions to total predictions.
- Commonly used as a primary metric for balanced datasets.
Interpretation:
- Higher accuracy indicates better overall performance.
- Limitations:
  - May not be suitable for imbalanced datasets.
  - Should be used alongside other metrics for comprehensive evaluation.
---

### 2. Gini Impurity
Formula:
$$
\text{Gini} = 1 - \sum_{i=1}^{c} (p_i)^2
$$
Description:
- Gini Impurity measures the probability of incorrect classification of a randomly chosen element.
- Used as a splitting criterion during tree construction.
Interpretation:
- Ranges from 0 (pure node) to 0.5 (maximum impurity for binary classification).
- Lower values indicate better class separation.
---

### 3. Information Gain
Formula:
$$
\text{IG}(T,a) = H(T) - \sum_{v \in \text{values}(a)} \frac{|T_v|}{|T|} H(T_v)
$$
Description:
- Information Gain measures the reduction in entropy after splitting on an attribute.
- Alternative splitting criterion to Gini impurity.
Interpretation:
- Higher values indicate more informative splits.
- Used to select the best features for splitting nodes.
---

### 4. Model Complexity Metrics
Description:
- Number of Iterations: The number of iterations required for the algorithm to converge.
- Number of Label Changes: The total number of label changes during the training process.
Interpretation:
- Lower complexity often indicates better generalization.
- Used for tuning hyperparameters and improving model efficiency.
---

### 5. Precision
Formula:
$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$
Description:
- Precision shows the accuracy of positive predictions.
- Important when false positives are costly.
Interpretation:
- Higher precision means fewer false positive predictions.
- Use case: Particularly important in medical diagnosis and spam detection.
---

### 6. Recall (Sensitivity)
Formula:
$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$
Description:
- Recall indicates the model's ability to identify all relevant cases.
- Critical in scenarios where missing positive cases is costly.
Interpretation:
- Higher recall means fewer false negatives.
- Use case: Essential in medical screening and fraud detection.
---

### 7. Feature Importance
Formula:
$$
\text{Importance}(x_i) = \sum_{t \in \text{splits on }x_i} n_t \cdot \Delta\text{impurity}
$$
Description:
- Measures the contribution of each feature to the model's decisions.
- Based on the total reduction in impurity from splits on each feature.
Interpretation:
- Higher values indicate more influential features.
- Useful for feature selection and model understanding.
---

### 8. Cross-Validation Scores
Description:
- K-fold cross-validation provides robust performance estimates.
- Includes metrics for each fold and their statistical distribution.
Interpretation:
- Low variance across folds indicates stable model performance.
- High variance may suggest overfitting or data inconsistencies.
---

### 9. Confusion Matrix
Description:
- Provides detailed breakdown of prediction outcomes:
  - True Positives (TP)
  - True Negatives (TN)
  - False Positives (FP)
  - False Negatives (FN)
Interpretation:
- Helps identify specific types of errors.
- Essential for understanding class-wise performance.
---


## sklearn template [SelfTrainingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html)

### class sklearn.semi_supervised.SelfTrainingClassifier(*, base_estimator=None, threshold=0.75, criterion='k_best', k_best=10, max_iter=10, verbose=False)

| **Parameter**               | **Description**                                                                                                                                        | **Default**      |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| `base_estimator`           | The base estimator to be used for self-training.                                                                                                      | `None`           |
| `threshold`                | Confidence threshold for predicting a label.                                                                                                          | `0.75`           |
| `criterion`                | The criterion to use to stop the self-training iterations. Options: 'threshold', 'k_best'.                                                           | `k_best`         |
| `k_best`                   | The number of samples to predict at each iteration if `criterion='k_best'`.                                                                            | `10`             |
| `max_iter`                 | The maximum number of self-training iterations.                                                                                                       | `10`             |
| `verbose`                  | Whether to output verbose information.                                                                                                                | `False`          |

-

| **Attribute**              | **Description**                                                                                                                                        |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `base_estimator_`          | The base estimator clone.                                                                                                                             |
| `transduction_`            | The predicted labels for the input data.                                                                                                              |
| `n_iter_`                  | The number of iterations run.                                                                                                                          |

-

| **Method**                 | **Description**                                                                                                                                        |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `fit(X, y)`                | Fit the self-training classifier from the training set.                                                                                               |
| `predict(X)`               | Predict class for X.                                                                                                                                  |
| `predict_proba(X)`         | Predict class probabilities of the input samples X.                                                                                                   |
| `score(X, y)`              | Returns the mean accuracy on the given test data and labels.                                                                                           |
| `get_params()`             | Get parameters for this estimator.                                                                                                                     |
| `set_params(**params)`     | Set the parameters of this estimator.                                                                                                                  |


# XXXXXXXX regression - Example

## Data loading

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Data Import
digits = datasets.load_digits()
X = digits.data
y = digits.target

# Step 2: Data Processing
# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a partially labeled dataset
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(y)) < 0.7
y[random_unlabeled_points] = -1  # Label some points as -1 (unlabeled)

# Step 3: Model Definition and Training
# Using SVM as the base estimator
base_estimator = SVC(probability=True, gamma='scale')

# Self-training classifier
self_training_model = SelfTrainingClassifier(base_estimator)
self_training_model.fit(X_scaled, y)

# Step 4: Model Evaluation
y_pred = self_training_model.predict(X_scaled)
y_true = digits.target

# Evaluating the model
accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:")
print(report)


Accuracy: 29.44%
Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00      1266
           0       0.24      1.00      0.38        42
           1       0.25      1.00      0.40        46
           2       0.33      1.00      0.50        58
           3       0.28      1.00      0.43        49
           4       0.29      1.00      0.45        53
           5       0.34      1.00      0.50        62
           6       0.27      1.00      0.43        49
           7       0.35      1.00      0.52        61
           8       0.30      0.98      0.46        54
           9       0.30      0.98      0.46        57

    accuracy                           0.29      1797
   macro avg       0.27      0.91      0.41      1797
weighted avg       0.09      0.29      0.14      1797



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
