# Semi-Supervised Learning With Label Propagation

In [1]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from numpy import concatenate
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from seaborn import catplot
%matplotlib inline

### Define dataset

For understanding semi-supervised learning, we will be using a synthetic dataset in this notebook. Scikit-learn `make_classification()` can be used to create a synthetic classification dataset. Let's create a dataset of 1000 instances with three features and three classes (binary classification). 

In [2]:
X, y = make_classification(
    n_samples=1000, n_features=3, n_classes=3, 
    n_informative=3, n_redundant=0, random_state=1)

### T1. Split the dataset into train and test and the train into labelded and unlabeled
We will split the dataset into train and test datasets with an equal 50-50 split. Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.50, random_state=1, stratify=y)
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(
    X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

### T2. Summarizing the training and test size

In [4]:
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
print('Test Set:', X_test.shape, y_test.shape)

Labeled Train Set: (250, 3) (250,)
Unlabeled Train Set: (250, 3) (250,)
Test Set: (500, 3) (500,)


### T3. Report baseline performance with decision tree

We can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.
This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.
In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset:

In [5]:
model = DecisionTreeClassifier()
model.fit(X_train_lab, y_train_lab)

DecisionTreeClassifier()

In [6]:
yhat = model.predict(X_test)
score1 = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score1*100))

Accuracy: 77.400


Your results may vary given the stochasitc nature of the algorithm or evaluation procedure, or differences in numerical precision.

### T4. Define and fit a label spreading method

In [7]:
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

### T5. Define and fit the Label Propagation model

In [8]:
model = LabelSpreading(max_iter=2000)
model.fit(X_train_mixed, y_train_mixed)

LabelSpreading(max_iter=2000)

### T6. Report accuracy of label spreading method

In [9]:
yhat = model.predict(X_test)
score2 = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score2*100))

Accuracy: 82.600


Here, the accuracy of the Label Propagation Model is slightly higher than the baseline model.

### T7. Fitting a supervised model using the estimated labels for the training dataset

In [10]:
# Get labels for entire training dataset data
tran_labels = model.transduction_

In [11]:
model2 = DecisionTreeClassifier()
model2.fit(X_train_mixed, tran_labels)

DecisionTreeClassifier()

### T8. Report predicition accuracy of layered models

In [12]:
yhat = model2.predict(X_test)
score3 = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score3*100))

Accuracy: 80.600


### T9. Discuss your observations

In this case, we observe that the label spreading method outperforms both the base and hierarchical models.