Using Clustering for Semi-Supervised Learning

How semi-supervised learning works

Imagine, you have collected a large set of unlabeled data that you want to train a model on. Manual labeling of all this information will probably cost you a fortune, besides taking months to complete the annotations. That’s when the semi-supervised machine learning method comes to the rescue.

 Instead of adding tags to the entire dataset, you go through and hand-label just a small part of the data and use it to train a model, which then is applied to the ocean of unlabeled data.

 Self-training
One of the simplest examples of semi-supervised learning, in general, is self-training.


Self-training is the procedure in which you can take any supervised method for classification or regression and modify it to work in a semi-supervised manner, taking advantage of labeled and unlabeled data.

Co-training

Derived from the self-training approach and being its improved version, co-training is another semi-supervised learning technique used when only a small portion of labeled data is available. Unlike the typical process, co-training trains two individual classifiers based on two views of data.

The views are basically different sets of features that provide additional information about each instance, meaning they are independent given the class. Also, each view is sufficient — the class of sample data can be accurately predicted from each set of features alone.

The original co-training research paper claims that the approach can be successfully used, for example, for web content classification tasks. The description of each web page can be divided into two views: one with words occurring on that page and the other with anchor words in the link leading to it.

https://www.altexsoft.com/blog/semi-supervised-learning/

https://content.altexsoft.com/media/2022/03/semi-supervised-co-training-method.png.webp




In [None]:
import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train, x_test = x_train / 255.0, x_test / 255.0



number=0
print('shape :',x_test.shape)
plt.imshow(x_train[number])

In [4]:
n_labeled = 50
log_reg = LogisticRegression()
log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])
log_reg.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8066666666666666

The accuracy is just 84.7%: it should come as no surprise that this is much lower than
earlier, when we trained the model on the full training set. Let’s see how we can do
better. First, let’s cluster the training set into 50 clusters, then for each cluster let’s find
the image closest to the centroid. 

In [5]:
y_train[:n_labeled]

array([7, 2, 9, 8, 2, 3, 6, 0, 5, 0, 5, 8, 1, 0, 9, 3, 3, 0, 3, 7, 0, 5,
       0, 9, 1, 4, 7, 8, 8, 6, 8, 2, 4, 4, 5, 4, 1, 8, 5, 2, 5, 1, 0, 3,
       4, 0, 7, 4, 4, 2])

fit transpose give distance of datapoint in clustter

In [6]:
k = 50
kmeans = KMeans(n_clusters=k)
X_digits_dist = kmeans.fit_transform(X_train)
representative_digit_idx = np.argmin(X_digits_dist, axis=0)
X_representative_digits = X_train[representative_digit_idx]

y_representative_digits = np.array([4,8,0,6,8,3,7,7,9,2,5,5,8,5,2,1,2,9,6,1,1,6,9,0,8,3,0,7,4,1,6,5,2,4,1,8,6,3,9,2,4,2,9,4,7,6,2,3,1,1])



array([3, 8, 0, 3, 6, 4, 2, 5, 7, 7, 4, 5, 1, 0, 7, 3, 6, 9, 9, 8, 1, 1,
       2, 7, 4, 2, 2, 8, 9, 9, 6, 9, 0, 8, 5, 7, 4, 3, 1, 7, 6, 7, 5, 1,
       4, 7, 3, 4, 4, 9])

In [21]:
print(representative_digit_idx)


[  54  765   66  291  562  848 1337    8  355  472 1300 1321  838  174
  800 1037 1141  885  954  725 1215  204  438  602 1081 1127  209 1232
  745  193 1270  336  915  369 1137 1252 1190  532  810 1070 1269  748
  365 1310  734  717  360  429 1338  951]


array([[ 0.,  0.,  2., 11., 16., 16., 16.,  4.,  0.,  0.,  5., 11.,  8.,
         8., 16.,  1.,  0.,  0.,  0.,  0.,  0., 14.,  6.,  0.,  0.,  0.,
         2., 10., 13., 16., 13.,  0.,  0.,  0., 12., 16., 16.,  9.,  2.,
         0.,  0.,  0.,  2.,  5., 14.,  0.,  0.,  0.,  0.,  0.,  0., 11.,
         9.,  0.,  0.,  0.,  0.,  0.,  0., 16.,  6.,  0.,  0.,  0.]])

In [27]:
log_reg = LogisticRegression()
log_reg.fit(X_representative_digits, y_representative_digits)
log_reg.score(X_test, y_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.1111111111111111