### **Semi-Supervised Learning for Document Labeling**
**Goal:** Build a text classifier that learns from a small amount of labeled documents and improves using large unlabeled text via self-training.

**Load or Simulate Document Dataset**

Simulate a small labeled dataset (e.g., positive/negative reviews) and a larger unlabeled set.

In [84]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from scipy.sparse import vstack

import pandas as pd
# Labeled examples (small)
labeled = pd.DataFrame({
  'text': [
      "Great product, I love this phone!",
      "Terrible experience, waste of money",
      "Very useful and affordable",
      "Worst service ever"
  ],
  'label': [1, 0, 1, 0]  # 1 = positive, 0 = negative
})
# Unlabeled examples (larger pool)
unlabeled = pd.DataFrame({
  'text': [
      "I love this phone",
      "Really satisfied with the quality",
      "Horrible app experience",
      "It works perfectly",
      "Fantastic design and build",
  ]
})

**Vectorize Text (TF-IDF)**

In [85]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Combine both labeled and unlabeled texts for vectorizer

all_text = pd.concat([labeled['text'], unlabeled['text']], axis=0)
# Fit TF-IDF on all text (transductive approach)

vectorizer = TfidfVectorizer()
vectorizer.fit(all_text)
X_labeled = vectorizer.transform(labeled['text'])
y_labeled = labeled['label']
X_unlabeled = vectorizer.transform(unlabeled['text'])

**Train Initial Classifier on Labeled Data**

In [86]:
model = LogisticRegression()
model.fit(X_labeled, y_labeled)

**Pseudo-Label the Unlabeled Data**

In [87]:
# Predict probabilities for unlabeled data
probs = model.predict_proba(X_unlabeled)
confidence = probs.max(axis=1)
pseudo_labels = model.predict(X_unlabeled)
# Select high-confidence pseudo-labeled samples
threshold = 0.9
confident_indices = confidence >= threshold
X_pseudo = X_unlabeled[confident_indices]
y_pseudo = pseudo_labels[confident_indices]

**Retrain Model with Pseudo-Labeled Data**

In [88]:
X_combined = vstack([X_labeled, X_pseudo])
y_combined = pd.concat([pd.Series(y_labeled), pd.Series(y_pseudo)], ignore_index=True)
X_combined, y_combined = shuffle(X_combined, y_combined, random_state=42)

model.fit(X_combined, y_combined)

**Evaluate**

In [89]:
# If you have a test set, evaluate model performance here. For demo, print sample
# predictions:
unlabeled['pseudo_label'] = model.predict(X_unlabeled)
print(unlabeled[['text', 'pseudo_label']])

                                text  pseudo_label
0                  I love this phone             1
1  Really satisfied with the quality             1
2            Horrible app experience             0
3                 It works perfectly             1
4         Fantastic design and build             1


### **📌Summary:**

*   **Objective:** Classify text documents (positive/negative sentiment) using a small labeled dataset and a larger unlabeled dataset.
*   **Approach**: Used self-training (a form of semi-supervised learning) with Logistic Regression.
*   **TF-IDF Vectorization:** Transformed text data into numeric vectors using TF-IDF.
*   **Initial Training:** Trained on a small labeled dataset (4 samples).
*   **Pseudo-Labeling:** Predicted labels for the unlabeled data and selected high-confidence predictions (≥ 90% confidence).
*   **Retraining:** Combined the original labeled and pseudo-labeled data to retrain the model for better accuracy.
*   **Final Output:** Model predicts sentiment (positive/negative) for previously unlabeled reviews.

✅ This project demonstrates how semi-supervised learning can effectively improve classification performance when labeled data is limited.