# **LabelPropagation Model Theory**


## Theory
Label Propagation is a semi-supervised learning algorithm that spreads labels from labeled data points to unlabeled data points through a graph structure. The main idea is to leverage the connections between data points to propagate labels across the graph, resulting in label assignment for previously unlabeled data.

## Label Propagation Process
1. **Graph Construction**:
- Represent the dataset as a graph where nodes correspond to data points and edges represent the similarity or connection between points.
- The edge weights reflect the degree of similarity.

2. **Initialization**:
- Assign labels to the labeled data points.
- Initialize unlabeled data points with a default label or leave them unlabeled.

3. **Propagation**:
- Iteratively update the label of each data point based on the labels of its neighbors.
- This process continues until labels converge or a stopping criterion is met.

## Key Steps
1. **Graph Construction**:
- Compute a similarity matrix to represent the connections between data points.
- Create a graph using the similarity matrix as edge weights.

2. **Label Initialization**:
- Assign initial labels to labeled data points.
- Initialize the label distribution for each unlabeled data point.

3. **Label Update**:
- For each unlabeled data point, update its label by aggregating the labels of its neighbors, weighted by the edge strengths.

4. **Convergence**:
- Repeat the label update step until the labels stabilize or the maximum number of iterations is reached.

## Mathematical Formulation
1. **Similarity Matrix**:
$$ W_{ij} = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right) $$

2. **Label Propagation Rule**:
$$ Y_{t+1} = \alpha WY_t + (1 - \alpha)Y_0 $$
where:
- \( Y_t \) is the label matrix at iteration \( t \).
- \( \alpha \) is the damping factor.
- \( W \) is the normalized similarity matrix.
- \( Y_0 \) is the initial label matrix.

## Advantages
- Efficient for large datasets.
- Leverages both labeled and unlabeled data.
- Suitable for problems with limited labeled data.

## Applications
- Image segmentation.
- Social network analysis.
- Text classification.



## Model Evaluation for LabelPropagation Classifier

### 1. Accuracy Score
Formula:
$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$
Description:
- Accuracy measures the ratio of correct predictions to total predictions.
- Commonly used as a primary metric for balanced datasets.
Interpretation:
- Higher accuracy indicates better overall performance.
- Limitations:
  - May not be suitable for imbalanced datasets.
  - Should be used alongside other metrics for comprehensive evaluation.
---

### 2. Gini Impurity
Formula:
$$
\text{Gini} = 1 - \sum_{i=1}^{c} (p_i)^2
$$
Description:
- Gini Impurity measures the probability of incorrect classification of a randomly chosen element.
- Used as a splitting criterion during tree construction.
Interpretation:
- Ranges from 0 (pure node) to 0.5 (maximum impurity for binary classification).
- Lower values indicate better class separation.
---

### 3. Information Gain
Formula:
$$
\text{IG}(T,a) = H(T) - \sum_{v \in \text{values}(a)} \frac{|T_v|}{|T|} H(T_v)
$$
Description:
- Information Gain measures the reduction in entropy after splitting on an attribute.
- Alternative splitting criterion to Gini impurity.
Interpretation:
- Higher values indicate more informative splits.
- Used to select the best features for splitting nodes.
---

### 4. Model Complexity Metrics
Description:
- Number of Iterations: The number of iterations required for the algorithm to converge.
- Number of Label Changes: The total number of label changes during the propagation process.
Interpretation:
- Lower complexity often indicates better generalization.
- Used for tuning hyperparameters and improving model efficiency.
---

### 5. Precision
Formula:
$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$
Description:
- Precision shows the accuracy of positive predictions.
- Important when false positives are costly.
Interpretation:
- Higher precision means fewer false positive predictions.
- Use case: Particularly important in medical diagnosis and spam detection.
---

### 6. Recall (Sensitivity)
Formula:
$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$
Description:
- Recall indicates the model's ability to identify all relevant cases.
- Critical in scenarios where missing positive cases is costly.
Interpretation:
- Higher recall means fewer false negatives.
- Use case: Essential in medical screening and fraud detection.
---

### 7. Feature Importance
Formula:
$$
\text{Importance}(x_i) = \sum_{t \in \text{splits on }x_i} n_t \cdot \Delta\text{impurity}
$$
Description:
- Measures the contribution of each feature to the model's decisions.
- Based on the total reduction in impurity from splits on each feature.
Interpretation:
- Higher values indicate more influential features.
- Useful for feature selection and model understanding.
---

### 8. Cross-Validation Scores
Description:
- K-fold cross-validation provides robust performance estimates.
- Includes metrics for each fold and their statistical distribution.
Interpretation:
- Low variance across folds indicates stable model performance.
- High variance may suggest overfitting or data inconsistencies.
---

### 9. Confusion Matrix
Description:
- Provides detailed breakdown of prediction outcomes:
  - True Positives (TP)
  - True Negatives (TN)
  - False Positives (FP)
  - False Negatives (FN)
Interpretation:
- Helps identify specific types of errors.
- Essential for understanding class-wise performance.
---


## sklearn template [LabelPropagation](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html)

### class sklearn.semi_supervised.LabelPropagation(*, kernel='rbf', gamma=20, n_neighbors=None, max_iter=1000, tol=0.001, n_jobs=None)

| **Parameter**               | **Description**                                                                                                                                        | **Default**      |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| `kernel`                   | String, kernel function used ('knn', 'rbf').                                                                                                          | `rbf`            |
| `gamma`                    | Float, parameter for rbf kernel function.                                                                                                             | `20`             |
| `n_neighbors`              | Integer, parameter for knn kernel function.                                                                                                           | `None`           |
| `max_iter`                 | Integer, maximum number of iterations allowed.                                                                                                        | `1000`           |
| `tol`                      | Float, tolerance stopping criterion.                                                                                                                  | `0.001`          |
| `n_jobs`                   | Integer, number of parallel jobs to run.                                                                                                              | `None`           |

-

| **Attribute**              | **Description**                                                                                                                                        |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `X_`                       | The input data used for fitting the model.                                                                                                            |
| `classes_`                 | The classes labels.                                                                                                                                   |
| `label_distributions_`     | Label distributions for each sample.                                                                                                                  |
| `transduction_`            | The predicted labels for the input data.                                                                                                              |
| `n_iter_`                  | Number of iterations run.                                                                                                                             |

-

| **Method**                 | **Description**                                                                                                                                        |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `fit(X, y)`                | Fit a semi-supervised label propagation model based on input data.                                                                                    |
| `predict(X)`               | Perform classification on input samples.                                                                                                              |
| `predict_proba(X)`         | Predict probability estimates for input samples.                                                                                                      |
| `score(X, y)`              | Returns the mean accuracy on the given test data and labels.                                                                                           |
| `get_params()`             | Get parameters for this estimator.                                                                                                                     |
| `set_params(**params)`     | Set the parameters of this estimator.                                                                                                                  |


# XXXXXXXX regression - Example

## Data loading

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.semi_supervised import LabelPropagation
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Data Import
digits = datasets.load_digits()
X = digits.data
y = digits.target

# Step 2: Data Processing
# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a partially labeled dataset
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(y)) < 0.7
y[random_unlabeled_points] = -1  # Label some points as -1 (unlabeled)

# Step 3: Model Definition and Training
lp_model = LabelPropagation()
lp_model.fit(X_scaled, y)

# Step 4: Model Evaluation
y_pred = lp_model.transduction_
y_true = digits.target

# Evaluating the model
accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:")
print(report)


Accuracy: 29.55%
Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00      1266
           0       0.20      1.00      0.33        42
           1       0.25      1.00      0.40        46
           2       0.34      1.00      0.51        58
           3       0.28      1.00      0.44        49
           4       0.32      1.00      0.48        53
           5       0.35      1.00      0.51        62
           6       0.27      1.00      0.42        49
           7       0.35      1.00      0.52        61
           8       0.32      1.00      0.48        54
           9       0.31      1.00      0.47        57

    accuracy                           0.30      1797
   macro avg       0.27      0.91      0.42      1797
weighted avg       0.09      0.30      0.14      1797



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
