# Semi Supervised (Sklearn) Demo

Developed by Jhonnatan Torres on Jan 12th, 2022

Version 1
___

**Goal**: Compare the performance of different *semi supervised* estimators available in [**scikit-learn**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.semi_supervised)

## Data Preparation

Importing required libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.semi_supervised import LabelPropagation, LabelSpreading, SelfTrainingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, accuracy_score

Using the *breast cancer* dataset for this demo

In [2]:
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [3]:
X.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405172,1.216853,2.866059,40.337079,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277313,0.551648,2.021855,45.491006,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.2324,0.8339,1.606,17.85,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.3242,1.108,2.287,24.53,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.4789,1.474,3.357,45.19,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873,4.885,21.98,542.2,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [4]:
y.value_counts()

1    357
0    212
Name: target, dtype: int64

In [5]:
y.value_counts(normalize=True)

1    0.627417
0    0.372583
Name: target, dtype: float64

In [6]:
ss_index = X.sample(frac=0.5, random_state=1234).index

In [7]:
y_ss = y.copy()

In order to use the *semi supervised* estimators, the unknown labels should have the value **-1** and for this demo, it is assigned to **50%** of the overall records

In [8]:
y_ss.loc[ss_index] = -1

In [9]:
## The goal is to predict the labels with a "-1" value
y_ss.values

array([-1,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0, -1,  0, -1,  0,  0,  0,
       -1,  0,  1, -1, -1, -1,  0, -1, -1,  0, -1,  0, -1,  0,  0,  0, -1,
       -1,  0, -1, -1,  0,  0, -1,  0,  0, -1, -1,  0,  1, -1,  1,  1, -1,
       -1,  1,  0, -1, -1,  0, -1, -1, -1,  1,  1,  0,  1, -1,  0, -1, -1,
        1, -1, -1, -1,  0, -1, -1,  0,  1, -1, -1,  1,  1,  1,  0,  0,  1,
       -1,  0,  0,  1, -1, -1, -1, -1, -1,  0, -1,  1, -1,  1, -1, -1, -1,
       -1,  1, -1,  0,  1,  1, -1, -1, -1, -1, -1, -1,  1, -1,  1,  0, -1,
        0,  1,  0, -1, -1, -1, -1,  0,  0, -1, -1,  1, -1,  0, -1, -1,  0,
        1,  1, -1,  1, -1, -1,  1, -1,  1,  1, -1,  1, -1, -1,  1,  1, -1,
        1,  1,  1, -1,  1,  1, -1, -1,  0, -1, -1,  0,  1, -1, -1, -1, -1,
        1, -1,  0, -1,  1,  1, -1,  0,  1,  1,  0, -1,  0,  1,  0, -1,  0,
       -1,  1, -1,  0,  1, -1,  0, -1,  1,  0,  0, -1,  0, -1,  0, -1,  0,
        1, -1, -1, -1,  1,  1,  0,  1,  0, -1, -1,  0,  1,  1, -1, -1, -1,
       -1, -1, -1,  1, -1

In [10]:
y_ss.value_counts()

-1    284
 1    176
 0    109
Name: target, dtype: int64

Using the index of the unknown labels to split the data into train and test datasets, the goal is compare the predictions against the actual values

In [11]:
X_train = X.drop(index=ss_index)
X_test = X.loc[ss_index]

y_train = y.drop(index=ss_index)
y_test = y.loc[ss_index]

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(285, 30) (284, 30) (285,) (284,)


In [12]:
X_train.head(10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,0.3345,0.8902,2.217,27.19,0.00751,0.03345,0.03672,0.01137,0.02165,0.005082,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,0.3063,1.002,2.406,24.32,0.005731,0.03502,0.03553,0.01226,0.02143,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,0.2976,1.599,2.039,23.94,0.007149,0.07217,0.07743,0.01432,0.01789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075
10,16.02,23.24,102.7,797.8,0.08206,0.06669,0.03299,0.03323,0.1528,0.05697,0.3795,1.187,2.466,40.51,0.004029,0.009269,0.01101,0.007591,0.0146,0.003042,19.19,33.88,123.8,1150.0,0.1181,0.1551,0.1459,0.09975,0.2948,0.08452
12,19.17,24.8,132.4,1123.0,0.0974,0.2458,0.2065,0.1118,0.2397,0.078,0.9555,3.568,11.07,116.2,0.003139,0.08297,0.0889,0.0409,0.04484,0.01284,20.96,29.94,151.7,1332.0,0.1037,0.3903,0.3639,0.1767,0.3176,0.1023
14,13.73,22.61,93.6,578.3,0.1131,0.2293,0.2128,0.08025,0.2069,0.07682,0.2121,1.169,2.061,19.21,0.006429,0.05936,0.05501,0.01628,0.01961,0.008093,15.03,32.01,108.8,697.7,0.1651,0.7725,0.6943,0.2208,0.3596,0.1431


In [13]:
X_test.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
531,11.67,20.02,75.21,416.2,0.1016,0.09453,0.042,0.02157,0.1859,0.06461,0.2067,0.8745,1.393,15.34,0.005251,0.01727,0.0184,0.005298,0.01449,0.002671,13.35,28.81,87.0,550.6,0.155,0.2964,0.2758,0.0812,0.3206,0.0895
166,10.8,9.71,68.77,357.6,0.09594,0.05736,0.02531,0.01698,0.1381,0.064,0.1728,0.4064,1.126,11.48,0.007809,0.009816,0.01099,0.005344,0.01254,0.00212,11.6,12.02,73.66,414.0,0.1436,0.1257,0.1047,0.04603,0.209,0.07699
485,12.45,16.41,82.85,476.7,0.09514,0.1511,0.1544,0.04846,0.2082,0.07325,0.3921,1.207,5.004,30.19,0.007234,0.07471,0.1114,0.02721,0.03232,0.009627,13.78,21.03,97.82,580.6,0.1175,0.4061,0.4896,0.1342,0.3231,0.1034
66,9.465,21.01,60.11,269.4,0.1044,0.07773,0.02172,0.01504,0.1717,0.06899,0.2351,2.011,1.66,14.2,0.01052,0.01755,0.01714,0.009333,0.02279,0.004237,10.41,31.56,67.03,330.7,0.1548,0.1664,0.09412,0.06517,0.2878,0.09211
220,13.65,13.16,87.88,568.9,0.09646,0.08711,0.03888,0.02563,0.136,0.06344,0.2102,0.4336,1.391,17.4,0.004133,0.01695,0.01652,0.006659,0.01371,0.002735,15.34,16.35,99.71,706.2,0.1311,0.2474,0.1759,0.08056,0.238,0.08718


In [14]:
y_train.value_counts()

1    176
0    109
Name: target, dtype: int64

In [15]:
y_test.value_counts()

1    181
0    103
Name: target, dtype: int64

## Logistic Regression as a baseline

In [16]:
lr_pl = make_pipeline(StandardScaler(), 
                      LogisticRegression(C=0.1, class_weight='balanced', random_state=1234))

In [17]:
lr_pl.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.1, class_weight='balanced',
                                    random_state=1234))])

In [18]:
lr_pl_preds = lr_pl.predict(X_test)

## Label Propagation

In [19]:
lp_pl = make_pipeline(StandardScaler(), 
                      LabelPropagation(kernel='knn', n_neighbors=13))

Please notice how the entire X was used to fit the estimator, in order to calculate the accuracy, the index of the "-1" labels will be used to compare the actuals against the predictions

In [20]:
lp_pl.fit(X, y_ss)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('labelpropagation',
                 LabelPropagation(kernel='knn', n_neighbors=13))])

In [21]:
lp_pl_preds = lp_pl.predict(X)

## Label Spreading

In [22]:
ls_pl = make_pipeline(StandardScaler(), 
                      LabelSpreading(kernel='knn', n_neighbors=13))

In [23]:
ls_pl.fit(X, y_ss)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('labelspreading',
                 LabelSpreading(kernel='knn', n_neighbors=13))])

In [24]:
ls_pl_preds = ls_pl.predict(X)

## Self Training Classifier

In [25]:
BASE_ESTIMATOR = LogisticRegression(C=0.1, class_weight='balanced', random_state=1234)
stc_pl = make_pipeline(StandardScaler(), 
                      SelfTrainingClassifier(base_estimator=BASE_ESTIMATOR))

In [26]:
stc_pl.fit(X, y_ss)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selftrainingclassifier',
                 SelfTrainingClassifier(base_estimator=LogisticRegression(C=0.1,
                                                                          class_weight='balanced',
                                                                          random_state=1234)))])

In [27]:
stc_pl_preds = stc_pl.predict(X)

## Final Comparison
Comapring the performance of all the 4 estimators used in this demo

In [28]:
# Baseline Logistic Regression
print(classification_report(y_test, lr_pl_preds))

              precision    recall  f1-score   support

           0       1.00      0.90      0.95       103
           1       0.95      1.00      0.97       181

    accuracy                           0.96       284
   macro avg       0.97      0.95      0.96       284
weighted avg       0.97      0.96      0.96       284



In [29]:
# Baseline Label Propagation
print(classification_report(y_test, lp_pl_preds[ss_index]))

              precision    recall  f1-score   support

           0       1.00      0.83      0.90       103
           1       0.91      1.00      0.95       181

    accuracy                           0.94       284
   macro avg       0.95      0.91      0.93       284
weighted avg       0.94      0.94      0.94       284



In [30]:
# Baseline Label Spreading
print(classification_report(y_test, ls_pl_preds[ss_index]))

              precision    recall  f1-score   support

           0       1.00      0.83      0.91       103
           1       0.91      1.00      0.96       181

    accuracy                           0.94       284
   macro avg       0.96      0.92      0.93       284
weighted avg       0.95      0.94      0.94       284



In [31]:
# Baseline Self Training Classifier
print(classification_report(y_test, stc_pl_preds[ss_index]))

              precision    recall  f1-score   support

           0       0.99      0.91      0.95       103
           1       0.95      0.99      0.97       181

    accuracy                           0.96       284
   macro avg       0.97      0.95      0.96       284
weighted avg       0.97      0.96      0.96       284



In [32]:
print("Accuracy Baseline Logistic Regression: {:.3f}".format(accuracy_score(y_test, lr_pl_preds)))
print("Accuracy Label Propagation: {:.3f}".format(accuracy_score(y_test, lp_pl_preds[ss_index])))
print("Accuracy Label Spreading: {:.3f}".format(accuracy_score(y_test, ls_pl_preds[ss_index])))
print("Accuracy Self Training Classifier: {:.3f}".format(accuracy_score(y_test, stc_pl_preds[ss_index])))

Accuracy Baseline Logistic Regression: 0.965
Accuracy Label Propagation: 0.937
Accuracy Label Spreading: 0.940
Accuracy Self Training Classifier: 0.965


In [33]:
print("Mean CV Score Label Propagation: {:.3f}".format(cross_val_score(lp_pl, X, y_ss).mean()))
print("Mean CV Score Label Spreading: {:.3f}".format(cross_val_score(ls_pl, X, y_ss).mean()))
print("Mean CV Score Self Training Classifier: {:.3f}".format(cross_val_score(stc_pl, X, y_ss).mean()))

Mean CV Score Label Propagation: 0.478
Mean CV Score Label Spreading: 0.483
Mean CV Score Self Training Classifier: 0.489


*Low mean CV scores could be related to the "-1" labels used in these estimators*, predictions (1 or 0) are compared against "-1"

## Closure Comments

*   Mostly of the default parameters were used to fit the estimators
*   In terms of performance, both the Baseline (Logistic Regression) and the Self Training Classifier had the same accuracy score
*   Both Label Propagation and Label Spreading estimators had a pretty similar performance
*   At least for this particular dataset, and using 50% of the records as *unknown labels (-1)* the Label Propagation and Spreading estimators had a performance similar to the baseline, in addition, these estimators handled some imbalance (63% - 37% ratio) present in the dataset, without a *class weight* or *sample weight* parameter




