# Understanding Semi-Supervised Learning Algorithms

## Technical requirements

We will use the following as technical requirements to run the code in this chapter:
- Python 3
- pip
- Tensorflow (with CUDA if you want to train models on GPUs)
    - Keras is installed as a dependency to this
- scikit-learn Python library
    - Numpy is installed as a dependency to this
- Pandas Python library
- Jupyter notebook if running the code directly from Jupyter

In [None]:
! python3 -m pip install --upgrade pip

### For M1+ Macbook (64-bit ARM Based processor)

In [None]:
! arch -arm64 pip3 install --upgrade pip
! arch -arm64 pip3 install tensorflow
! arch -arm64 pip3 install -U scikit-learn
! arch -arm64 pip3 install pandas

### For Other Computer Systems

In [None]:
! pip3 install --upgrade pip
! pip3 install tensorflow
! pip3 install -U scikit-learn
! pip3 install pandas

## 2. Pseudo-labeling: Label the unlabeled¶

### 2.3 Hands-on generating pseudo-labels

To illustrate the application of pseudo-labeling, we will create a learning model using the CIFAR-10 dataset, which consists of 60,000 32x32 color images in 10 different classes, split into 50,000 training images and 10,000 test images. For our purposes to demonstrate the process of pseudo-labeling the unlabeled data, we will treat a small subset of the training images as labeled and the rest as unlabeled. We will go through the process as discussed earlier step by step. We begin by importing the necessary modules.

In [None]:
from tensorflow.keras import models, layers, datasets, utils
import numpy as np

#### Train a Baseline Model on Small Labeled Dataset
First, we load and preprocess the CIFAR-10 dataset, just as we did in chapter 1, to have the training and testing images and their labels, with each image normalized in range [-1, 1]. Additionally, we also perform one-hot encoding of the class labels to convert them from integer class value to a vector.

In [None]:
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images = train_images / 127.5 - 1
test_images = test_images / 127.5 - 1

train_labels = utils.to_categorical(train_labels, 10)
test_labels = utils.to_categorical(test_labels, 10)

Next we use a small subset (1000 images) of the labeled data to be considered as labeled and drop the labels of the remaining labeled data. To ensure all classes are represented in this small subset, we pick the data points equally from each class.

In [None]:
num_labeled_per_class = 100
num_classes = 10

labeled_indices = []
for i in range(num_classes):
    indices = np.where(np.argmax(train_labels, axis=1) == i)[0]
    labeled_indices.extend(np.random.choice(indices, num_labeled_per_class, replace=False))

train_images_labeled_subset = train_images[labeled_indices]
train_labels_labeled_subset = train_labels[labeled_indices]

Then, we train a basic Convolutional Neural Network (CNN) model on this small labeled subset. Our CNN would consist of a 3 convolutional and pooling layers, using ReLU as the activation. These are followed by a dropout layer in the final layer after pooling for regularization, and a dense layer for the class label output. Since this is a multi-class logistic regression model, we use softmax as the activation function to predict class probabilities for each input image.

In [None]:
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dropout(0.25),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])

We then specify the optimizer and loss function that would help backpropagate the learning to update model parameters and train the model. We use categorical cross entropy for logistic regression model. 

In [None]:
model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])
model.fit(train_images_labeled_subset, 
          train_labels_labeled_subset, 
          batch_size=32, 
          epochs=20, 
          validation_split=0.1
         )

To understand the model performance trained on the small subset of labeled data, we get the accuracy on the test data as follows:

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f'Test accuracy: {test_acc}')

#### Generate Pseudo-labels
Next we use the trained model to predict the labels of the unlabeled data. Here we consider the data from training set as unlabeled data which was not considered labeled previously.

In [None]:
unlabeled_indices = np.array([i for i in range(train_images.shape[0]) 
                              if i not in labeled_indices])
train_images_unlabeled_subset = train_images[unlabeled_indices]

predictions = model.predict(train_images_unlabeled_subset)

#### Confidence Thresholding
We will select predictions with high confidence and use them as labels. In `pseudo_labels`, we store the data index along with the predicted label for only those data points that meet or exceed the confidence threshold.

In [None]:
confidence_threshold = 0.9
pseudo_labels = []

for i, prediction in enumerate(predictions):
    if max(prediction) > confidence_threshold:
        pseudo_labels.append((unlabeled_indices[i], np.argmax(prediction)))

#### Data Augmentation
Once we have filtered out the data with predictions having low confidence score, we can now add the pseudo-labels to the training set.

In [None]:
pseudo_indices = [index for index, _ in pseudo_labels]
train_images_pseudo_labels = utils.to_categorical([label for _, label in pseudo_labels], 10)

train_images_combined = np.concatenate([train_images[labeled_indices], train_images[pseudo_indices]])
train_labels_combined = np.concatenate([train_labels_labeled_subset, train_images_pseudo_labels])

#### Model Retraining
Finally, we retrain the model on the combined original labeled data and pseudo-labeled data to improve its accuracy.

In [None]:
model.fit(train_images_combined, 
          train_labels_combined, 
          batch_size=32, 
          epochs=20, 
          validation_split=0.1)

test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f'Test accuracy: {test_acc}')

## 3. Self-training: Using pseudo-labels

### 3.2 Building a Self-Training Model

As discussed earlier, the initial steps of self-training algorithm remain the same as in pseudo-labeling. What is different is the re-training of the model iteratively from scratch. So let us use the same dataset to create a self-training model that is trained iteratively on labeled and unlabeled data. Below we define a method that performs a single iteration of model training on labeled and pseudo-labeled data.

In [None]:
def self_train(model, train_labeled_images, train_labeled_labels, train_unlabeled_images, threshold):
    print(f"Labeled data size: {train_labeled_images.shape[0]}")
    
    model.fit(train_labeled_images, 
              train_labeled_labels, 
              batch_size=32, 
              epochs=20, 
              validation_split=0.1,
              verbose=0)

    predictions = model.predict(train_unlabeled_images)
    high_confidence_indices = np.max(predictions, axis=1) > threshold
    pseudo_labels = np.argmax(predictions[high_confidence_indices], axis=1)
    
    labeled_images_combined = np.concatenate((train_labeled_images, 
                                              train_unlabeled_images[high_confidence_indices]))
    labeled_labels_combined = np.concatenate((train_labeled_labels, 
                                              utils.to_categorical(pseudo_labels, 10)))

    remaining_unlabeled_indices = [i for i in range(train_unlabeled_images.shape[0]) 
                                           if i not in high_confidence_indices]
    remaining_unlabeled_images = np.array(train_unlabeled_images[remaining_unlabeled_indices])

    return labeled_images_combined, labeled_labels_combined, remaining_unlabeled_images

In the above code, we pass the CNN model, the originally labeled data as train_labeled_images, original labels as train_labeled_labels and the unlabeled data as train_unlabeled_images. The confidence threshold is also passed to assign pseudo-labels to unlabeled data and include them in the labeled dataset. We then return the combined labeled dataset including the pseudo labeled data, their labels and the set of the remaining unlabeled data.

After the initial model is trained, we get the pseudo-labels based on the threshold, add it to the expanded labeled set and return the 3 sets of labeled data, labels and indices of unlabeled data. Next, let's iteratively re-train the model, by incrementally reducing the confidence threshold in each iteration. We first define a function get_model() that returns a model with resetted weights.

In [None]:
def get_model():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(128, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dropout(0.25),
        layers.Dense(128, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])

    model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])
    
    return model

Then, we self-train the model iteratively by reducing the threshold by 5% in each iteration. We start by initializing the labeled and unlabeled dataset and setting the threshold to 95% and number of iterations to 3. Over each iteration, we closely monitor the model's accuracy and convergence.

In [None]:
threshold=0.95
iterations = 3

labeled_images_combined = train_images_labeled_subset
labeled_labels_combined = train_labels_labeled_subset
remaining_unlabeled_images = train_images_unlabeled_subset

for iteration in range(iterations):
    model = get_model()
    
    labeled_images_combined, labeled_labels_combined, remaining_unlabeled_images = self_train(model, 
                                           labeled_images_combined, 
                                           labeled_labels_combined, 
                                           remaining_unlabeled_images, 
                                           threshold)

    threshold *= 0.95

    loss, accuracy = model.evaluate(test_images, test_labels)
    print(f"Iteration {iteration+1}, Loss: {loss}, Accuracy: {accuracy}\n")

As can be seen that the performance of the model over each iteration gets better even though the initial set of labeled data is still limited.

## 4. Co-Training: A multi-model approach

### 4.2 Implementing a Co-Training Model

In this section, we would perform a hand-on exercise to train a co-training model. We will use UCI Multiple Features dataset [https://archive.ics.uci.edu/dataset/72/multiple+features] for this exercise. The UCI Multiple Features dataset is a collection of features of handwritten numerals 0-9. We can leverage six different feature sets (views) that are inherently provided by the dataset, extracted from a collection of Dutch utility maps. For our demonstration, let's use two of these views, specifically "Fourier coefficients of the character shapes" and "Karhunen-Love coefficients" which are available in the UCI repository as `mfeat-fourier` and `mfeat-karhunen`.

We will start by first importing the necessary libraries for the model training. The `fetch_openml` is used to load the UCI Multiple Features dataset with different views, `train_test_split` is used to split the dataset in training and testing sets, `StandardScaler` is used to standardize the dataset to zero mean and unit variance. We also use Gaussian Naive Bayes (`GaussianNB`) classifiers and `accuracy_score` to train and evaluate the co-training models. 

In [None]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

Let us next load the 76 Fourier coefficients of the character shapes `mfeat-fourier` and 64 Karhunen-Love coefficients `mfeat-karhunen` as two views of the dataset. Assuming both datasets share the same order of instances, we only load the target labels from one view since both the views will have same target labels.

In [None]:
fourier = fetch_openml('mfeat-fourier', version=1)
karhunen = fetch_openml('mfeat-karhunen', version=1)

X_fourier, y = fourier.data, fourier.target
X_karhunen = karhunen.data

We then standardize the dataset to zero mean and unit variance using `sklearn`'s `StandardScaler`.

In [None]:
scaler_fourier = StandardScaler()
scaler_karhunen = StandardScaler()
X_fourier_scaled = scaler_fourier.fit_transform(X_fourier)
X_karhunen_scaled = scaler_karhunen.fit_transform(X_karhunen)

Now we split the dataset into training and testing data as follows. We retain 30% of the data for testing.

In [None]:
X_train_fourier, X_test_fourier, y_train, y_test = train_test_split(X_fourier_scaled, y, test_size=0.3, random_state=42)
X_train_karhunen, X_test_karhunen = train_test_split(X_karhunen_scaled, test_size=0.3, random_state=42)

print(f"Shape of Fourier based train data {np.shape(X_train_fourier)}")
print(f"Shape of Fourier based test data {np.shape(X_test_fourier)}")
print(f"Shape of Karhunen based train data {np.shape(X_train_karhunen)}")
print(f"Shape of Karhunen based test data {np.shape(X_test_karhunen)}")

As can be noticed, the first view of "Fourier coefficients of the character shapes" has 76 features while the second view of "Karhunen-Love coefficients" has 64 features. The 2000 data points are divided into 1400 training data and 600 test data.

Let's split the above data into 4 different sets. For each view, we will obtain a small labeled subset and use larger subset as unlabeled to simulate the co-training method. For our exercise, we consider first 300 data points from training data as labeled and rest 1100 data points as unlabeled. You can choose a different number for obtaining labeled dataset.

In [None]:
NUM_LABELED = 300

X_labeled_fourier = X_train_fourier[:NUM_LABELED]
X_unlabeled_fourier = X_train_fourier[NUM_LABELED:]
X_labeled_karhunen = X_train_karhunen[:NUM_LABELED]
X_unlabeled_karhunen = X_train_karhunen[NUM_LABELED:]
y_labeled = y_train[:NUM_LABELED]

Now, we'll create our co-training function to handle two different feature sets, ensuring each classifier trains on distinct information and train the models on two different views of the dataset we have created earlier. In the `co_train` function, we take labeled and unlabeled data from two views along with the initial target labels. We train the models on the initial target labels and then in each iteration, we get the pseudo-labels from the two classifiers, select the high confidence pseudo-labels and then append the high confidence pseudo-labels from one classifier to the labeled dataset of the other classifier. We train the models again in the new subset of training data that has pseudo-labeled data with it. For the purpose of this demonstration, we train both the classifier models as Gaussian Naive Bayes. We also define `evaluate_models` function that evaluates the two classifiers on the test data. When the label from both the classifiers agree, we select the label from `classifier1`. But if there is a disagreement on the label, we would randomly select the label from either of the classifiers with 0.5 probability of label selection from each of the 2 classifiers.

In [None]:
def evaluate_models(classifier1, classifier2, X_test_fourier, X_test_karhunen):
    predictions1 = classifier1.predict(X_test_fourier)
    predictions2 = classifier2.predict(X_test_karhunen)

    random_mask = np.random.rand(len(predictions1)) > 0.5 
    final_pred = np.where(predictions1 == predictions2, predictions1, np.where(random_mask, predictions1, predictions2))

    print("Accuracy of Co-Training model:", accuracy_score(y_test, final_pred))

In [None]:
def co_train(X_labeled_view1, X_labeled_view2, y_train, X_unlabeled_view1, X_unlabeled_view2, X_test_fourier, X_test_karhunen, n_iterations=10, threshold=0.9):
    classifier1 = GaussianNB()
    classifier2 = GaussianNB()
    
    classifier1.fit(X_labeled_view1, y_train)
    classifier2.fit(X_labeled_view2, y_train)

    print("For base classifiers")
    evaluate_models(classifier1, classifier2, X_test_fourier, X_test_karhunen)

    for iteration in range(n_iterations):
        pseudo_labels1 = classifier1.predict(X_unlabeled_view1)
        pseudo_labels2 = classifier2.predict(X_unlabeled_view2)
        
        confidences1 = classifier1.predict_proba(X_unlabeled_view1).max(axis=1) > threshold
        confidences2 = classifier2.predict_proba(X_unlabeled_view2).max(axis=1) > threshold

        new_X1 = X_unlabeled_view1[confidences2]
        new_y1 = pseudo_labels2[confidences2]
        new_X2 = X_unlabeled_view2[confidences1]
        new_y2 = pseudo_labels1[confidences1]

        pseudo_labeled_X1 = np.concatenate((X_labeled_view1, new_X1))
        y_train1 = np.concatenate((y_train, new_y1))
        pseudo_labeled_X2 = np.concatenate((X_labeled_view2, new_X2))
        y_train2 = np.concatenate((y_train, new_y2))

        classifier1.fit(pseudo_labeled_X1, y_train1)
        classifier2.fit(pseudo_labeled_X2, y_train2)

        print("Iteration: ", iteration+1)
        evaluate_models(classifier1, classifier2, X_test_fourier, X_test_karhunen)

    return classifier1, classifier2

classifier1, classifier2 = co_train(X_labeled_fourier, 
                                    X_labeled_karhunen, 
                                    y_labeled, 
                                    X_unlabeled_fourier, 
                                    X_unlabeled_karhunen, 
                                    X_test_fourier, 
                                    X_test_karhunen,
                                    threshold=0.95)


As we can notice, the accuracy of the models was slightly better than the base classifiers. However, the accuracy keeps swinging between iterations. That could be because in each iteration, we resolve the confict in case of label disagreement using a randomizer.