### **📝 Summary of the Experiment**  

This experiment demonstrates **semi-supervised learning** using **XGBoost** to progressively label unlabeled data with high confidence.  

#### **🔹 Key Steps:**  
1. **Dataset Creation:**  
   - A synthetic multi-class dataset (**1000 samples, 4 classes**) is generated.  
   - The dataset is split into **training (80%)** and **testing (20%)** sets.  

2. **Initial Labeled & Unlabeled Data Split:**  
   - Only **400 samples are initially labeled**, while **400 remain unlabeled** (simulating real-world scenarios with limited labeled data).  

3. **Iterative Labeling Process:**  
   - A model is trained on the **labeled dataset**.  
   - Predictions are made on **unlabeled samples**.  
   - **Only high-confidence predictions (>90%)** are added to the labeled dataset.  
   - The process repeats until no more high-confidence samples remain.  

4. **Final Model Training & Evaluation:**  
   - The final model is trained on the **expanded labeled dataset**.  
   - Accuracy is evaluated on the **test set**.  

#### **🔹 Key Findings:**  
- **Semi-supervised learning improved training data** without manual labeling.  
- The model **gradually learned from confidently predicted samples**, leading to better performance.  
- The approach is useful when **labeling data is expensive or time-consuming**.  
- **Limitation:** If the model is not confident, many unlabeled samples remain unused.  

✅ **Conclusion:**  
Semi-supervised learning using confidence-based self-labeling is an effective strategy for leveraging both labeled and unlabeled data, improving model performance while reducing manual effort. 🚀

In [40]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd 
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [41]:
X, y = make_classification(n_samples = 10000, n_features = 10, n_classes = 4, n_informative = 4)

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [43]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8000, 10), (2000, 10), (8000,), (2000,))

In [44]:
X_label, y_label = X_train[:400], y_train[:400]
X_unlabel, y_unlabel = X_train[400:], y_train[400:]

X_label, y_label = pd.DataFrame(X_label), pd.DataFrame(y_label)
X_unlabel, X_unlabel = pd.DataFrame(X_unlabel), pd.DataFrame(X_unlabel)

In [45]:
print("Label Shape")
X_label.shape, y_label.shape

Label Shape


((400, 10), (400, 1))

In [46]:
print("Unlabel shape")
X_unlabel.shape, y_unlabel.shape

Unlabel shape


((7600, 10), (7600,))

In [53]:
while True:
    model = XGBClassifier(objective = "multi:softmax", num_classes = 4, random_state = 42)
    model.fit(X_label, y_label.values.ravel()) # ravel() function flattens an input array into a  1D array

    # Since X_unlabel has no labels, we use predict_proba() to get probability scores for each class.
    y_pred_probs = model.predict_proba(X_unlabel)
    
    # We filter only those samples where the highest predicted class probability is above 90%.
    confident_indexes = np.where(y_pred_probs.max(axis = 1) > 0.90)[0]
    
    # If there are no confident predictions left, we exit the loop.
    if not confident_indexes.size:
        break
        
    # Append high-confidence samples to labeled dataset
    X_label = pd.concat([X_label, X_unlabel.iloc[confident_indexes]])
    y_label = pd.concat([y_label, pd.DataFrame(y_pred_probs[confident_indexes].argmax(axis = 1))])

    # Drop used samples from unlabeled dataset and reset index
    X_unlabel.drop(confident_indexes, inplace = True)
    X_unlabel.reset_index(drop = True, inplace = True)

## Logic for above code
### 1. Train the model on the currently labeled dataset (X_label, y_label).
### 2. Predict probabilities for all remaining unlabeled samples (X_unlabel).
### 3. Find samples where the model is highly confident (probability > 90%).
### 4. Add these samples to the labeled dataset (X_label, y_label).
### 5. Remove these samples from the unlabeled dataset (X_unlabel).
### 6. Repeat the process until no more high-confidence samples remain.

In [54]:
final_model = XGBClassifier(objective = "multi:softmax", n_classes = 4)
final_model.fit(X_label, y_label.values.ravel())

In [55]:
y_pred_final = final_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred_final)
print(f"Final Accuracy: {final_accuracy:.4f}")

Final Accuracy: 0.7430
