- Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data
- Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data
- A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagate known labels through the edges of the graph to label unlabeled examples
- An example of this approach to semi-supervised learning is the label propagation algorithm for classification predictive modeling

### Label Propagation Algorithm

- Label Propagation is a semi-supervised learning algorithm
- The intuition for the algorithm is that a graph is created that connects all examples (rows) in the dataset based on their distance, such as Euclidean distance
- Propagation refers to the iterative nature that labels are assigned to nodes in the graph and propagate along the edges of the graph to connected nodes
- This procedure is sometimes called label propagation, as it “propagates” labels from the labeled vertices (which are fixed) gradually through the edges to all the unlabeled vertices
- The process is repeated for a fixed number of iterations to strengthen the labels assigned to unlabeled examples

Supervised

In [1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
model = LogisticRegression()
model.fit(X_train_lab, y_train_lab)
yhat = model.predict(X_test)
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score*100))

Accuracy: 87.880


Semi-Supervised

In [2]:
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation

X, y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.50, random_state=1, stratify=y)
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
y_train_mixed = concatenate((y_train_lab, nolabel))

model = LabelPropagation()
model.fit(X_train_mixed, y_train_mixed)
yhat = model.predict(X_test)
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score*100))

Accuracy: 95.700


Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model

In [None]:
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

nolabel = [-1 for _ in range(len(y_test_unlab))]
y_train_mixed = concatenate((y_train_lab, nolabel))

model = LabelPropagation()
model.fit(X_train_mixed, y_train_mixed)
tran_labels = model.transduction_

model2 = LogisticRegression()
model2.fit(X_train_mixed, tran_labels)
yhat = model2.predict(X_test)

score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score*100))