### A gentle introduction to semi-supervise learning with Label Spreading

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagates known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label spreading algorithm for classification predictive modeling.

> The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label. "Learning With Local And Global Consistency, 2003."

> Another similar label propagation algorithm was given by Zhou et al.: at each step a node i receives a contribution from its neighbors j (weighted by the normalized weight of the edge (i,j)), and an additional small contribution given by its initial value

> The label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

### 1. Semi-Supervised Learning

In [1]:
# Usual Imports

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Define Dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# Split the datastet into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1, stratify=y)
# Split the train inot labeled and unlabeled
X_train_lab, X_train_unlab, y_train_lab, y_train_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# Summarize training set size 
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_train_unlab.shape, y_train_unlab.shape)
# Summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)


In [3]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train_lab, y_train_lab)
# Prediction on test set
y_hat = lr.predict(X_test)
# Calculate score for test set
score = accuracy_score(y_test, y_hat)
# Summarize score
print("Accuracy score: %.3f" %(score * 100))


Accuracy score: 84.800


### 2. Semi-Supervised and Lable Spreading

In [8]:
from numpy import concatenate
from sklearn.semi_supervised import LabelSpreading
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Create training dataset input
X_train_mixed = concatenate((X_train_lab, X_train_unlab))
# Create "no label " for unlabeled data
nolabel = [-1 for _ in range(len(y_train_unlab))]
# Recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

# Summarize the shape and values

In [9]:
ls = LabelSpreading().fit(X_train_mixed, y_train_mixed)
y_hat = ls.predict(X_test)
score = accuracy_score(y_test, y_hat)
print('Accuracy: %.3f' % (score * 100))

Accuracy: 85.400


### 3. Semi-supervised with Label Spreading and Supervised Learning
- algorithm fits the semi-supervised model on the entire training dataset, 
- then fits a supervised learning model on the entire training dataset 
- with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

In [10]:
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# Get lables for entire training datset
tran_labels = model.transduction_

In [12]:
# Define supervised learning model
lr2 = LogisticRegression()
lr2.fit(X_train_mixed, tran_labels)

In [13]:
# Make predictions on test set
y_hat = lr2.predict(X_test)
# Score for test set
score = accuracy_score(y_test, y_hat)
# Summarize Score
print('Accuracy: %.3f' %(score * 100))

Accuracy: 85.800


#### CONCLUSION

We can see that this hierarchical approach of semi-supervised model followed by supervised model achieves a classification accuracy of about 85.8 percent on the holdout dataset, slightly better than the semi-supervised learning algorithm used alone that achieved an accuracy of about 85.6 percent.