# **SEMI-SUPERVISED LEARNING ON FASHION MNIST DATASET**

In [None]:
!pip install ipython-autotime --quiet
%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 12 s (started: 2024-04-21 14:18:26 +00:00)


#  Import Necessary Models

In [None]:
# import necessary models
from tqdm import tqdm
import numpy as np
from tensorflow import keras
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.linear_model import LogisticRegression

time: 1.08 ms (started: 2024-04-21 14:18:57 +00:00)


# Load Fashion MNIST data

Fashion MNIST dataset can be easily accessed in Python through the Keras library, which provides a simple interface to download and load the dataset.

In [None]:
# Load Fashion MNIST data

(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
time: 3.84 s (started: 2024-04-21 14:19:06 +00:00)


# Data Preprocessing

The following preprocessing steps standardize the pixel values and reshape the image data into a format suitable for feeding into a machine learning model, typically for tasks such as image classification.

In [None]:
# Preprocess data

X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

time: 179 ms (started: 2024-04-21 14:19:13 +00:00)


# A function to print a particular image

In [None]:
# Printing a particular image

get_original_array = lambda arr: np.array([int(x * 255) for x in arr], dtype=np.uint8)

def print_image(i, data="train"):
  if type(i) is int:
    if data == 'train':
      return get_original_array(X_train[i]).reshape(28,28)
    else:
      return get_original_array(X_test[i]).reshape(28,28)
  else:
    return get_original_array(i).reshape(28,28)
#example
print_image(48)

time: 21 ms (started: 2024-04-21 15:20:15 +00:00)


# Standard Logistic Regression

At first, we fit a standard logistic regression model on the Fashion MNIST dataset and observe the accuracy score.

In [None]:
# Standard logistic model has a very high accuracy for the dataset.

logistic_regressor_clf = LogisticRegression(random_state=69, max_iter=10000).fit(X_train, y_train)
logistic_regressor_clf.score(X_test, y_test)

0.8442

time: 9min 57s (started: 2024-04-21 14:37:10 +00:00)


Note that the standard logistic regression model gives an accuracy score of 0.8442 for the Fashion MNIST dataset (which is a pretty high score).

# **Now, we will perform semi-supervised learning on the Fashion MNIST dataset and compare the accuracy scores.**

1.   First, we will take 100 random samples from the dataset use it to train the
 logistic regression model.
2.   Next, we will use k-means clustering to get 100 clusters and take the nearest points as centroids. Then we will use those 100 data-points to train the logistic regression classifier.
3. Then we will propagate the labelling to the whole cluster and do the same.
4. Repeat the same, but only propagate to the 20% of the dataset now.



# Perform Logistic Regression On Randomly Selected 100 Samples

In [None]:
# Randomly select 100 samples from the training data
idx = np.random.choice(X_train.shape[0], 100, replace=False)

# Get the selected samples and their corresponding labels
X_sample = X_train[idx]
y_sample = y_train[idx]

logistic_regressor_clf = LogisticRegression(random_state=69, max_iter=10000).fit(X_sample, y_sample)
logistic_regressor_clf.score(X_test, y_test)

0.6925

time: 662 ms (started: 2024-04-21 14:19:49 +00:00)


An accuracy score of 0.6925 is obtained when we use randomly selected 100 samples from the dataset to train the model. Note that the accuracy is lesser than the standard logistic regression model.

# Perform Logistic Regression On 100 Cluster Representative Points

In [None]:
# Use KMeans to select 100 representative samples and then train the classifier

kmeans = MiniBatchKMeans(n_clusters=100, random_state=42)
kmeans.fit(X_train)

cluster_centers = kmeans.cluster_centers_

nearest_points_to_cluster = [-1]*100
distance_from_nearest_points_to_cluster = [10e9]*100

# Select those representative points
for i, cluster_center in enumerate(cluster_centers):
  for j, data_point in enumerate(X_train):
    distance = np.linalg.norm(cluster_center-data_point)
    if distance < distance_from_nearest_points_to_cluster[i]:
      distance_from_nearest_points_to_cluster[i] = distance
      nearest_points_to_cluster[i] = j

X_sample_ = X_train[nearest_points_to_cluster]
y_sample_ = y_train[nearest_points_to_cluster]

# Train and test the logistic regression classifier
logistic_regressor_clf = LogisticRegression(random_state=69, max_iter=10000).fit(X_sample_, y_sample_)
logistic_regressor_clf.score(X_test, y_test)



0.7078

time: 49.2 s (started: 2024-04-21 14:19:54 +00:00)


 We observe that the accuracy  score for the above model has improved but is still less than the standard logistic regression model.

# Perform Logistic Regression On 100 Cluster Representative Points After Propagating the Labels

In [None]:
# Use KMeans to select 100 representative samples, propagate the labels and then train the classifier

kmeans = MiniBatchKMeans(n_clusters=100, random_state=42)
kmeans.fit(X_train)

cluster_centers = kmeans.cluster_centers_
datapoints_labels = kmeans.labels_

# Nearest points of each cluster
nearest_points_to_cluster = [-1]*100
distance_from_nearest_points_to_cluster = [10e9]*100

# Mapping between labels and nearest points
nearest_point_of_a_label = [-1]*100

# Select those representative points
for i, cluster_center in enumerate(cluster_centers):
  for j, data_point in enumerate(X_train):
    distance = np.linalg.norm(cluster_center-data_point)
    if distance < distance_from_nearest_points_to_cluster[i]:
      distance_from_nearest_points_to_cluster[i] = distance
      nearest_points_to_cluster[i] = j
      nearest_point_of_a_label[datapoints_labels[j]] = j

y_propagated = [-1]*(X_train.shape[0])
for i, datapoint in enumerate(X_train):
  cluster_label = datapoints_labels[i]
  y_propagated[i] = y_train[nearest_point_of_a_label[cluster_label]]

# Train and test the logistic regression classifier
logistic_regressor_clf = LogisticRegression(random_state=69, max_iter=10000).fit(X_train, y_propagated)
logistic_regressor_clf.score(X_test, y_test)



0.6992

time: 9min 51s (started: 2024-04-21 14:20:43 +00:00)


# Perform Logistic Regression On 100 Cluster Representative Points After Propagating the Labels Partially

In [None]:
# Use KMeans to select 100 representative samples, propagate the labels partially and then train the classifier

kmeans = MiniBatchKMeans(n_clusters=100, random_state=42)
kmeans.fit(X_train)

cluster_centers = kmeans.cluster_centers_
datapoints_labels = kmeans.labels_

# Nearest points of each cluster
nearest_points_to_cluster = [-1]*100
distance_from_nearest_points_to_cluster = [10e9]*100

# Mapping between labels and nearest points
nearest_point_of_a_label = [-1]*100

# Select those representative points
for i in tqdm(range(len(cluster_centers))):
  for j, data_point in enumerate(X_train):
    distance = np.linalg.norm(cluster_centers[i]-data_point)
    if distance < distance_from_nearest_points_to_cluster[i]:
      distance_from_nearest_points_to_cluster[i] = distance
      nearest_points_to_cluster[i] = j
      nearest_point_of_a_label[datapoints_labels[j]] = j

closest_points_from_centroid = np.array([[1]*120]*100)

for i in tqdm(range(len(nearest_points_to_cluster))):
  distance_point_pair = np.empty((0, 2))

  for j, data_point in enumerate(X_train):

    dist = np.linalg.norm(data_point-X_train[nearest_points_to_cluster[i]])
    to_append = np.array([dist, j])
    distance_point_pair = np.concatenate((distance_point_pair, [to_append]))

  sorted_indices = np.argsort(distance_point_pair[:, 0])
  distance_point_pair = (distance_point_pair[sorted_indices])[:120]

  closest_points_idx = distance_point_pair[:, 1]
  closest_points_from_centroid[i] = closest_points_idx

X_sample = []
y_propagated = []

for i, id in enumerate(closest_points_from_centroid):
  for j in id:
    X_sample.append(X_train[j])
    y_propagated.append(y_train[nearest_points_to_cluster[i]])

# Train and test the logistic regression classifier
logistic_regressor_clf = LogisticRegression(random_state=37, max_iter=10000).fit(X_sample, y_propagated)
logistic_regressor_clf.score(X_test, y_test)

100%|██████████| 100/100 [00:38<00:00,  2.62it/s]
100%|██████████| 100/100 [04:37<00:00,  2.77s/it]


0.7041

time: 6min 34s (started: 2024-04-21 14:30:35 +00:00)


## Report:

- NOTE THAT THE NUMBER OF CLUSTERS TO BE MADE WAS CHOSEN BASED ON THE SIZE OF THE DATASET.
- Standard logistic model has a very high accuracy for the dataset. (`0.8442`)
- Now, we choose randomly 100 datapoints (out of 60000) and use standard logistic regression to get (`0.6925`). The performance was poor.
- Then, instead of choosing randomly, we cluster the dataset into 100 clusters and used the nearest points to the centroids as the representative points to train the standard logistic model. The accuracy was `0.7078`. The accuracy is greatly improved as we are now using the best of its kind to train.
- Next, we propagate the labels to every cluster point to get the accuracy as `0.6992`. Performance drops as there might be spurious labelling.
- Now, instead of propagating to all the points, we propagate to the 20% data (`7041`). Surprisingly, the performance improves implying labelling small dataset accordingly (semi-supervised learning) performs greatly.