# **S3VM Model Theory**


## S3VM (Semi-Supervised Support Vector Machine)

---

## Theory
S3VM is a semi-supervised learning technique that combines labeled and unlabeled data to improve the performance of a Support Vector Machine (SVM). In typical SVM, only labeled data is used to find a decision boundary, but in S3VM, both labeled and unlabeled data are considered. The key idea is to find a decision boundary that not only separates the labeled data but also best fits the unlabeled data, by treating them as if they belong to the correct class based on model confidence.

The main idea is to:
- Train an initial SVM classifier using a small set of labeled data.
- Use the classifier to predict labels for the unlabeled data.
- Treat the most confident predictions from the classifier as additional labeled data.
- Update the classifier by including these new pseudo-labels and repeat the process until convergence.

---

## Mathematical Foundation
- **Model Training**:
  In SVM, the objective is to find a hyperplane \( f(x) \) that maximizes the margin between classes. This is formulated as:
  $$ \min_w \frac{1}{2} \| w \|^2 \quad \text{subject to} \quad y_i(w \cdot x_i + b) \geq 1 \quad \forall i \in L $$

- **Unlabeled Data Contribution**:
  For unlabeled data \( U = \{ x_{n+1}, x_{n+2}, ..., x_{n+m} \} \), the classifier is used to predict pseudo-labels \( \hat{y}_i \) based on the decision boundary:
  $$ \hat{y}_i = \text{sign}(w \cdot x_i + b) $$

- **Objective Function**:
  The objective function of S3VM includes both labeled and pseudo-labeled data, penalizing for misclassifications:
  $$ \min_w \frac{1}{2} \| w \|^2 + C_L \sum_{i \in L} \mathcal{L}(y_i, f(x_i)) + C_U \sum_{i \in U} \mathcal{L}(\hat{y}_i, f(x_i)) $$

  where:
  - \( C_L \) and \( C_U \) are regularization parameters for labeled and unlabeled data.
  - \( \mathcal{L} \) is a loss function (e.g., hinge loss for classification).

---

## Algorithm Steps
1. **Initialization**:
   - Start with a small labeled dataset \( L \) and a large unlabeled dataset \( U \).

2. **Initial SVM Training**:
   - Train an SVM model \( f \) using the labeled dataset \( L \).

3. **Predict Pseudo-Labels**:
   - Use the model \( f \) to predict pseudo-labels for the unlabeled dataset \( U \).

4. **Select Confident Predictions**:
   - Choose the most confident predictions based on the margin from the SVM decision boundary:
     $$ \text{Confidence}(x_i) = |w \cdot x_i + b| $$

5. **Dataset Update**:
   - Add the most confident pseudo-labeled examples to the labeled dataset \( L \).

6. **Model Retraining**:
   - Retrain the SVM model using the updated labeled dataset.

7. **Repeat**:
   - Repeat the process of prediction, selection, and retraining until convergence or a stopping criterion is met.

---

## Key Parameters
- **C_L**: Regularization parameter for labeled data.
- **C_U**: Regularization parameter for unlabeled data.
- **kernel**: The kernel function used for the SVM (e.g., linear, RBF).
- **max_iter**: The maximum number of iterations for the algorithm.
- **threshold**: The confidence threshold for selecting pseudo-labeled data.

---

## Advantages
- Can improve the SVM model by using unlabeled data.
- Effective when labeled data is scarce or expensive to obtain.
- Flexible to different kernel functions and model types.
- Can handle large amounts of unlabeled data.

---

## Disadvantages
- Sensitive to the quality of the pseudo-labels.
- May propagate errors if the model is not confident in its predictions.
- Requires careful tuning of regularization parameters \( C_L \) and \( C_U \).
- Computationally expensive due to retraining after each update.

---

## Implementation Tips
- Use a **well-calibrated SVM** to ensure reliable confidence scores for pseudo-labels.
- Start with a **diverse labeled dataset** to avoid bias in initial model predictions.
- Consider using a **smaller confidence threshold** initially to avoid incorrect pseudo-labels.
- Use **cross-validation** to monitor the performance of the model during updates.

---

## Applications
- Text classification (e.g., sentiment analysis, topic classification).
- Image classification (e.g., object detection, facial recognition).
- Bioinformatics (e.g., protein classification, gene expression).
- Speech recognition (e.g., speaker identification).
- Anomaly detection in various domains (e.g., fraud detection, system monitoring).

S3VM is a powerful technique for semi-supervised learning, particularly in scenarios where labeled data is limited but large amounts of unlabeled data are available. While it requires careful tuning and evaluation, it can significantly improve the model's performance in many real-world applications.


## **Model Evaluation for S3VM**

### 1. Accuracy

**Formula:**
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

**Description:**
- Measures the overall correctness of the S3VM model's predictions.
- Compares the number of correct predictions to the total predictions.

**Interpretation:**
- Higher accuracy indicates better performance.
- Can be misleading if the dataset is imbalanced.

---

### 2. Precision

**Formula:**
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

**Description:**
- Measures the proportion of correctly predicted positive samples out of all predicted positives.
- Helps assess the reliability of the S3VM model’s positive predictions.

**Interpretation:**
- High precision means fewer false positives.
- Important when false positives are costly.

---

### 3. Recall (Sensitivity)

**Formula:**
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Description:**
- Measures how many actual positives were correctly identified by the S3VM.
- Crucial when detecting all possible positives is important.

**Interpretation:**
- High recall means fewer false negatives.
- Important for detecting rare events or critical positive samples.

---

### 4. F1-Score

**Formula:**
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

**Description:**
- A harmonic mean between precision and recall.
- Useful for evaluating S3VM performance in cases of imbalanced datasets.

**Interpretation:**
- Higher F1-score indicates a good balance between precision and recall.
- Valuable when both false positives and false negatives are costly.

---

### 5. Confusion Matrix

**Description:**
- A table summarizing true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
- Helps in understanding how well the S3VM model distinguishes between classes.

**Interpretation:**
- Visualizes classification errors.
- Important for analyzing model mistakes, especially for complex data.

---

### 6. AUC-ROC Curve

**Description:**
- Plots True Positive Rate (TPR) vs. False Positive Rate (FPR).
- AUC (Area Under Curve) measures the overall performance of the S3VM model.

**Interpretation:**
- **AUC = 1** → Perfect model.
- **AUC > 0.8** → Strong model.
- **AUC = 0.5** → Random guessing.

---

### 7. Margin Distribution

**Description:**
- S3VM focuses on maximizing the margin between classes.
- This metric evaluates how well the support vectors are distributed around the decision boundary.

**Interpretation:**
- A larger margin typically leads to better generalization.
- Helps assess the model’s robustness to noise and variance.

---

### 8. Number of Support Vectors

**Description:**
- The number of support vectors used in the S3VM model.
- A key factor in understanding model complexity.

**Interpretation:**
- Too many support vectors may indicate overfitting.
- Fewer support vectors suggest a simpler model, but could also lead to underfitting.

---

### 9. Convergence Rate

**Description:**
- S3VM involves iterative optimization, and this metric tracks how quickly the model converges to the optimal decision boundary.

**Interpretation:**
- Faster convergence is ideal, but too rapid convergence may indicate inadequate exploration.
- Slower convergence could mean better precision in finding the optimal margin.

---

### 10. Cross-Validation

**Description:**
- Cross-validation helps assess the generalization ability of the S3VM model by splitting the dataset into multiple training and validation sets.

**Interpretation:**
- Reduces overfitting risk.
- Provides a more robust estimate of the model’s performance.


## S3VM (Semi-Supervised Support Vector Machine)

### class libsvm.svm.S3VM

S3VM is a semi-supervised learning algorithm that combines labeled and unlabeled data for training, and is implemented in the LIBSVM library. It is used for classification tasks, where it iteratively assigns pseudo-labels to unlabeled data.

| **Parameter**   | **Description**                                                                 |
|-----------------|-------------------------------------------------------------------------------|
| C               | Regularization parameter that controls the trade-off between maximizing the margin and minimizing classification errors. |
| kernel         | The kernel function used to transform the data, typically `'linear'`, `'rbf'`, etc. |
| max_iter       | Maximum number of iterations for training the S3VM model.                    |
| tolerance      | Tolerance for stopping criterion, when to stop the iterative process.        |
| unlabeled_data | The set of unlabeled data that is used to refine the model's decision boundary. |
| nu              | A parameter that controls the fraction of unlabeled data that is allowed to have an incorrect pseudo-label. |

-

| **Attribute**         | **Description**                                                                 |
|-----------------------|-------------------------------------------------------------------------------|
| support_vectors_      | The support vectors identified by the SVM during training.                    |
| dual_coef_            | Dual coefficients associated with the support vectors in the SVM optimization problem. |
| labels_               | The final labels assigned to both labeled and pseudo-labeled (unlabeled) samples after training. |

-

| **Method**            | **Description**                                                                 |
|-----------------------|-------------------------------------------------------------------------------|
| fit(X_labeled, y_labeled, X_unlabeled) | Train the S3VM model using both labeled and unlabeled data. The algorithm iteratively assigns pseudo-labels to the unlabeled data based on the current model. |
| predict(X)            | Predict labels for input data `X` using the trained S3VM model.              |
| decision_function(X)  | Compute the decision function for input `X` (used to classify the data).     |

-

### Documentation
[S3VM Documentation - LIBSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)


In [None]:
# XXXXXXXX regression - Example

## Data loading

##  Data processing

## Plotting data

## Model definition

## Model evaulation