<h1 style='text-aling:center;color:Navy'>  Big Data Science - Fall 2023  </h1>
<h1 style='text-aling:center;color:Navy'>  Assignment 1  </h1>

***

<b>Submission Deadline: This assignment is due Friday, November 3 at 8:59 P.M.</b>

A few notes before you start:
- You are not allowed to use built-in libraries for co-training and label propagation itself.
- Directly sharing answers is not okay, but discussing problems with other students is encouraged.
- You should start early so that you have time to get help if you're stuck.

- Complete all the exercises below and turn in a write-up in the form of a Jupyuter notebook, that is, an .ipynb file. The write-up should include your code and answers to exercise questions. You will submit your assignment online as an attachment (*.ipynb), through Canvas under Assignment 1.

# <span style="color:#3665af">Semi-Supervised Learning </span>
<hr>

###### Goal
In this assignment, we will explore the concepts and techniques of semi-supervised learning.

###### Prerequisites
This assignment has the following dependencies:
- Jupyter Notebook, along with the following libraries (which should be installed on the Computing Platform):
  - Scikit Learn
  - Numpy
  - os

Let's dive into the world of semi-supervised learning!

<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">Assignment Hands-on 

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> Import libraries </div>

In [146]:
# importing libraries
#We have imported many libraries from both Scikit Learn and TensorFlow (Keras). 
#These libraries cover a range of functionalities including array operations, model training, evaluation metrics, and deep learning model components.
#numpy: Used to handle numerical operations and arrays.
#accuracy_score: Used to calculate the classification accuracy of a given model.
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import os
from sklearn.ensemble import RandomForestClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Dense,
                                     Flatten,
                                     Dropout,
                                     BatchNormalization,
                                     Conv2D,
                                     MaxPooling2D,)
from tensorflow.keras.regularizers import l2


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> Learn more about the data </div>


In this assignment, we are working with data collected from two different view from same region via satellite technology to study the Arctic region. These two data types offer valuable insights into various sea ice types, thereby enhancing navigation in the Arctic.

1. **Sentinel-1 Data (view 1):** We use Synthetic Aperture Radar (SAR) satellite images from the Sentinel-1 mission. SAR images are incredibly useful for creating sea ice charts in the Arctic. SAR works by sending radar signals to the Earth's surface and capturing the signals that bounce back. One specific view we're utilizing is the Sentinel-1 image captured in HH polarization. This view helps us understand the characteristics of the sea ice in the Arctic. if you want to know more about it here is the [link](https://en.wikipedia.org/wiki/Sentinel-1).

2. **AMSR2 Data (view 2):** Alongside each Sentinel-1 image, we have corresponding data from the Advanced Microwave Scanning Radiometer 2 (AMSR2). This dataset contains information about the brightness temperatures of the Earth's surface. AMSR2 measures microwave radiation, which can be used to gather information about surface properties like sea ice concentration. if you want to know more about it here is the [link](https://www.ospo.noaa.gov/Products/atmosphere/gpds/about_amsr2.html).

To further analyze these datasets, we selected 10 files and divided the images into smaller patches, each patch 32 by 32 pixels. This patches allows us to focus on specific areas of interest within the Arctic and study them in detail. By combining the information from both Sentinel-1 and AMSR2 data, we can gain a comprehensive understanding of the Arctic environment and its sea ice patterns, which is crucial for various scientific and practical applications, including safe navigation in this challenging region.

view 1: Sentinel-1 image

<img alt="nersc_sar_primary view" src="nersc_sar_primary.jpg"/>

view 2: AMSR2 image

<img src="btemp_89_0h.jpg" alt = "btemp_89_0h view" >

Download the data.zip file from Canvas, and then execute the cell below to import the data. You can customize the directory name for the data if necessary.

In [147]:
#The function load_data_from_directories it takes in paths to three directories containing data from two views and their corresponding labels.
#It lists all files in each directory.
#Iterate over the files in one directory (in this case, the Sentinel-1 data directory).
#For each file, it checks if there are corresponding files in the other two directories (the AMSR2 data directory and the labels directory).
#If corresponding files are found, the data is loaded from these files and appended to respective
def load_data_from_directories(view1_dir, view2_dir, labels_dir):
    """
    Load data from directories containing two views and corresponding labels.

    Parameters:
    - view1_dir (str): Path to the directory containing view 1 data files.
    - view2_dir (str): Path to the directory containing view 2 data files.
    - labels_dir (str): Path to the directory containing label data files.

    Returns:
    - view1_data (numpy.ndarray): NumPy array containing data from view 1.
    - view2_data (numpy.ndarray): NumPy array containing data from view 2.
    - labels_data (numpy.ndarray): NumPy array containing label data.

    This function loads data from two views and their corresponding labels, assuming a common "number" part
    in the file names for matching files. It ensures that data files from both views and labels are consistent
    and loads them into NumPy arrays for further processing.
    """
    # List all files in each directory
    files_view1 = os.listdir(view1_dir)
    files_view2 = os.listdir(view2_dir)
    files_label = os.listdir(labels_dir)

    # Initialize empty lists to store data from each view and labels
    view1_data = []
    view2_data = []
    labels_data = []

    # Iterate through the files in the directory
    for filename in files_view1:
        if filename.endswith('_samples_view1.npy'):
            # Extract the common "number" part of the file name
            common_number = filename.split('_')[0]

            # Check if corresponding files exist for view2 and labels
            if common_number + '_samples_view2.npy' in files_view2 and common_number + '_labels.npy' in files_label:
                # Load data from the NumPy files
                data_view1 = np.load(os.path.join(view1_dir, filename))
                data_view2 = np.load(os.path.join(view2_dir, common_number + '_samples_view2.npy'))
                data_labels = np.load(os.path.join(labels_dir, common_number + '_labels.npy'))

                # Append data to respective lists
                view1_data.append(data_view1)
                view2_data.append(data_view2)
                labels_data.append(data_labels)

    view1_data = np.array(view1_data)
    view2_data = np.array(view2_data)
    labels_data = np.array(labels_data)

    return view1_data, view2_data, labels_data

#These are the location paths of the data sets 
view1_dir = '/Users/AL_Loan/Documents/Big data Assignment/data/view1' #this will be changing according to the location  change the directory as needed
view2_dir = '/Users/AL_Loan/Documents/Big data Assignment/data/view2' #this will be changing according to the location  change the directory as needed
labels_dir = '/Users/AL_Loan/Documents/Big data Assignment/data/labels' #this will be changing according to the location  change the directory as needed

#This ststement calls the function load_data_from_directories
view1_data, view2_data, labels_data = load_data_from_directories(view1_dir, view2_dir, labels_dir)

#These statements below prints the shape of view1, view2 and the labels in the data set
print(" shape view 1 data: ", view1_data.shape)
print(" shape view 2 data: ", view2_data.shape)
print(" shape labels data: ", labels_data.shape)


 shape view 1 data:  (13683, 32, 32)
 shape view 2 data:  (13683, 32, 32)
 shape labels data:  (13683, 1)


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part  1. Co-Training Models for Sea Ice Classification</div>

In this task, you'll be applying the Co-training technique to the dataset. Through this, you'll observe the outcomes of both semi-supervised learning and supervised learning when there's only a limited amount of labeled data available. You may want to revisit Lecture 02, which covers the topic of cotraining for a more comprehensive understanding

1-1. Divide the dataset into three distinct sets: one for labeled data, one for unlabeled data, and one for test data. Make sure that the labeled dataset contains between 100 and 130 data.

In [151]:
#The function split_dataset is designed to take two views of a dataset and split them into labeled, unlabeled, and test sets. 
#The parameters allow for the specification of the size of the labeled data, the proportion of the data to be used as the test set, and a random seed for reproducibility. 
#The function then returns these subsets along with their corresponding labels.
def split_dataset(dataset_view1, dataset_view2, labeled_size=120, test_size=0.2, random_seed=42):
    """
    Split the dataset into labeled, unlabeled, and test sets.

    Parameters:
    - dataset (list or array-like): The input dataset to be split.
    - labeled_size (int): The target size for the labeled set (default: 130).
    - test_size (float): The proportion of the dataset to include in the test split (default: 0.2).
    - random_seed (int): Seed for reproducibility (default: None).

    Returns:
    - labeled_set_view1: Subset of the dataset with labeled data (approximately 100-130 points).
    - labeled_set_view2: Subset of the dataset with labeled data (approximately 100-130 points).
    - label_labeled_set: Labels corresponding to the labeled data points.
    - unlabeled_set: Subset of the dataset with unlabeled data.
    - test_set: Subset of the dataset for testing.
    - label_test_set: Labels corresponding to the test data points.
    """
    # First, let's create a common index for shuffling and splitting the data
    #The function starts by creating an array of indices ranging from 0 to the length of dataset_view1. 
    #both dataset_view1 and dataset_view2 have the same number of samples.
    n_sam = len(dataset_view1)
    indcs = np.arange(n_sam)

    # Shuffle the indices for random sampling
    #It uses the provided random seed to ensure reproducibility and shuffles these indices to randomize the selection of samples for the labeled, unlabeled, and test sets.
    np.random.seed(random_seed)
    np.random.shuffle(indcs)

    # Determine the number of labeled data points
    #The function ensures that the size of the labeled set (l_size) is between 100 and 130, inclusive, regardless of the given labeled_size.
    l_size = max(100, min(labeled_size, 130))

    # Calculate the number of labeled and test data points
    #It calculates the exact number of samples that will be in the labeled set (n_label) and the test set (n_test), 
    #the latter based on the test_size proportion of the total number of samples.
    n_label = int(l_size)
    n_test = int(test_size * n_sam)

    # Split the data into subsets of labeled, unlabeled, and test sets 
    l_indcs = indcs[:n_label]
    t_indcs = indcs[n_label:n_label + n_test]
    ul_indcs = indcs[n_label + n_test:]

    #Using the shuffled indices, the function creates:
    #labeled_set_view1 and labeled_set_view2: These are the first n_label samples from both views for labeled data.
    labeled_set_view1 = dataset_view1[l_indcs]
    labeled_set_view2 = dataset_view2[l_indcs]
    
    #unlabeled_set_view1 and unlabeled_set_view2: These contain the remaining samples after removing the labeled and test samples, for unlabeled data.
    unlabeled_set_view1 = dataset_view1[ul_indcs]
    unlabeled_set_view2 = dataset_view2[ul_indcs]

    #test_set_view1 and test_set_view2: These are the next n_test samples from both views for test data.
    test_set_view1 = dataset_view1[t_indcs]
    test_set_view2 = dataset_view2[t_indcs]
    
    #label_labeled_set and label_test_set: These are the labels corresponding to the labeled and test data points, respectively.
    label_test_set = labels_data[t_indcs]
    label_labeled_set = labels_data[l_indcs] 
    
    #It returns the subsets and their corresponding labels.
    return labeled_set_view1, labeled_set_view2, label_labeled_set, unlabeled_set_view1, unlabeled_set_view2, test_set_view1, test_set_view2, label_test_set    

#This statement calls the finction split_dataset and assigns in into variables
lab_s_v1, lab_s_v2, lab_labl_set, unl_s_v1, unl_s_v2, t_s_v1, t_s_v2, lab_t_set = split_dataset(view1_data, view2_data, labeled_size=120, test_size=0.2, random_seed=42)

#Receive the outputs and print shapes: 
#This is useful for verification, to ensure that the splits have the expected sizes.
#print(lab_s_v2.shape, lab_s_v1.shape, lab_labl_set.shape, unl_s_v1.shape, unl_s_v2.shape, t_s_v1.shape, t_s_v2.shape, lab_t_set.shape)
print("The Shape of labelled view 1:", lab_s_v1.shape)
print("The Shape of labelled view 2:", lab_s_v2.shape)
print("The Shape of Labeled Set shape:", lab_labl_set.shape)
print("The Shape of the labelled test set:", lab_t_set.shape)
print("The Shape of unlabeled_set_view1:", unl_s_v1.shape, "\n" "The Shape of unlabeled_set_view2:", unl_s_v2.shape, "\n" "The Shape of test_set_view1:", t_s_v1.shape, "\n" "The Shape of test_set_view2:", t_s_v2.shape,)

The Shape of labelled view 1: (120, 32, 32)
The Shape of labelled view 2: (120, 32, 32)
The Shape of Labeled Set shape: (120, 1)
The Shape of the labelled test set: (2736, 1)
The Shape of unlabeled_set_view1: (10827, 32, 32) 
The Shape of unlabeled_set_view2: (10827, 32, 32) 
The Shape of test_set_view1: (2736, 32, 32) 
The Shape of test_set_view2: (2736, 32, 32)


1-2. initialize two classifiers for each view using scikit-learn. Consider using a Convolutional Neural Network (CNN) as one of the classifiers and a Random Forest as the other.
Here's a short description of the configuration for the CNN (Convolutional Neural Network) and Random Forest (RF) classifiers to implement:

**CNN Classifier Configuration:**

1. Input Layer: BatchNormalization with input shape (32, 32, 1).
2. Convolutional Layer 1: 32 filters, each with a 3x3 kernel and ReLU activation.
3. Max Pooling Layer 1: 2x2 pooling with a stride of 2.
4. Convolutional Layer 2: 32 filters, each with a 3x3 kernel and ReLU activation.
5. Max Pooling Layer 2: 2x2 pooling with a stride of 2.
6. Convolutional Layer 3: 32 filters, each with a 3x3 kernel and ReLU activation.
7. Max Pooling Layer 3: 2x2 pooling with a stride of 2.
8. BatchNormalization Layer.
9. Flatten Layer.
10. Dropout Layer with a dropout rate of 0.1.
11. Fully Connected Layer 1: 16 neurons, ReLU activation, and L2 regularization with a weight decay of 0.001.
12. Dropout Layer with a dropout rate of 0.1.
13. Fully Connected Layer 2: 16 neurons, ReLU activation, and L2 regularization with a weight decay of 0.001.
14. Dropout Layer with a dropout rate of 0.1.
15. Output Layer: Dense layer with the number of neurons equal to the number of classes and softmax activation.

**CNN Model Compilation:**
- Optimizer: Adam
- Loss Function: Sparse Categorical Crossentropy
- Metrics for Evaluation: Accuracy

**Random Forest Classifier Configuration:**
- Number of Estimators: 20
- Random State: 42




In [152]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.layers import Input
num_classes = 6
#This code implements classifiers based on the provided definition
#cnn_classifier =
#rf_classifier =
#This is the python code for the Convolution Neural Network (CNN) classifier configuration as per the defination that the professor has given in the explaination above at 1-2
#This code models a CNN classifier
cnn_classifier = Sequential([
    Input(shape=(32, 32, 1)),
    BatchNormalization(),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    BatchNormalization(),
    Flatten(),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(num_classes, activation='softmax')
])

cnn_classifier.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])


#This is the python code for the Random Forest classifier configuration as per the defination that the professor has given in the explaination above at 1-2
#This code models a Random Forest classifier.
rf_classifier = RandomForestClassifier(n_estimators=20, random_state=42)

1-3. Do the co-training part:
   - Train classifiers on the labeled data
   - Predict on the unlabeled data and identify instances that have a confidence score more than 90.
   - Add the confident instances to the labeled set and train again
   - Compute the accuracy of the classifiers on test set
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).
<img src="accuracy.jpg" alt = "accuracy metric" >

In [153]:
#Here is a python code that outputs the summary of the CNN classifier. 
#It outputs the type of model along with the different layers in the output, the shape of the layers along with the parameters in each output layer
cnn_classifier.summary()
#This will be helpful in determining and to verify that the CNN classifier is modelled correctly

Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 batch_normalization_28 (Ba  (None, 32, 32, 1)         4         
 tchNormalization)                                               
                                                                 
 conv2d_30 (Conv2D)          (None, 30, 30, 32)        320       
                                                                 
 max_pooling2d_30 (MaxPooli  (None, 15, 15, 32)        0         
 ng2D)                                                           
                                                                 
 conv2d_31 (Conv2D)          (None, 13, 13, 32)        9248      
                                                                 
 max_pooling2d_31 (MaxPooli  (None, 6, 6, 32)          0         
 ng2D)                                                           
                                                     

In [112]:
#The co_training function implements a co-training algorithm, which is a semi-supervised learning technique
#This improves the performance of two classifiers by using a small amount of labeled data and a larger pool of unlabeled data. 
#The function takes as inputs two classifiers, labeled data, unlabeled data, test data, and a confidence threshold. 
#It returns the accuracy of both classifiers on the test set after the co-training process.
def co_training(classifier1, classifier2, labeled_set, unlabeled_set, test_set , threshold_confidence):
    """
    Perform co-training with two classifiers on labeled and unlabeled data.

    Parameters:
    - classifier1: The first classifier (e.g., CNN).
    - classifier2: The second classifier (e.g., Random Forest).
    - labeled_set (list or array-like): Labeled dataset.
    - unlabeled_set (list or array-like): Unlabeled dataset.
    - test_set (list or array-like): Test dataset.
    - threshold_confidence (float): The minimum confidence threshold for adding unlabeled samples to the training set.

    Returns:
    - classifier1_accuracy (float): Accuracy of Classifier 1 on the test set after co-training.
    - classifier2_accuracy (float): Accuracy of Classifier 2 on the test set after co-training.
    """
    #labeled_set and test_set are tuples containing the data and labels.
    lab_s_v1, lab_labl_set = labeled_set
    t_s_1, lab_t_set = test_set

    #It reshapes the labeled and unlabeled sets as needed for the classifiers. 
    #For example, the labeled set lab_s_v1 might be reshaped to a 2D array if required by classifier2. 
    #Similarly, it flattens the labels.
    lab_s_v1_2d = lab_s_v1.reshape(lab_s_v1.shape[0], -1)
    unl_s_2d = unlabeled_set.reshape(unlabeled_set.shape[0], -1)
    lab_labl_set = lab_labl_set.ravel()
    unl_set = unl_s_2d
    
    #The co-training process involves fitting both classifiers on the current labeled data 
    #Then using these classifiers to predict labels for the unlabeled data. 
    #If both classifiers are sufficiently confident about their predictions these unlabeled instances are moved to the labeled set.
    while len(unl_set) > 0:
        
        # Fit classifier1 on the labeled data
        print(len(unl_set), "len(unl_set)")
        
        #classifier1 and classifier2 are trained on the current labeled data. 
        #classifier1 seems to be expecting data shaped in a way suitable for CNNs (hence the 4D reshaping)
        classifier1.fit(lab_s_v1, lab_labl_set, epochs=10, verbose=2)

        # Fit classifier2 on the 2D labeled data
        # #while classifier2 is trained on 2D data, perhaps suitable for a classifier like a Random Forest or SVM.
        classifier2.fit(lab_s_v1_2d, lab_labl_set)

        # Predict on the unlabeled set using both classifiers
        #Both classifiers make predictions on the unlabeled data. 
        #classifier1 predict probabilities (since it uses max to get the highest confidence score).
        unl_set = unl_set.reshape(unl_set.shape[0], 32, 32, 1)
        predct1 = classifier1.predict(unl_set)
        #while classifier2 seems to output direct predictions.
        unl_set = unl_set.reshape(unl_set.shape[0], 1024)
        predct2 = classifier2.predict(unl_set)
        
        #both classifiers are above the confidence threshold. T
        #hese instances are considered to be confidently labeled by both classifiers.
        confdt_indcs = [] 
        for i in range(len(unl_set)):
            confdce1 = max(predct1[i])  # Get the maximum probability of the predicted classes
            #confidence2 = max(classifier2.predict_proba([unl_s_2d[i]])[0])
            confdce2 = predct2[i]
              
            #Confident instances are added to the labeled set, and their respective data points are removed from the unlabeled set.
            if float(confdce1) > threshold_confidence and float(confdce2) > threshold_confidence:
                confdt_indcs.append(i)
        
        #If no confident instances are found, the loop breaks, and no more unlabeled data is added to the labeled set.
        if len(confdt_indcs) == 0:
            break

            
        #both classifiers are evaluated on the test set. 
        # Move confident instances from unlabeled to labeled set
        confdt_da = [unl_set[i] for i in confdt_indcs]
        confdt_la = [predct1[i] for i in confdt_indcs]

        lab_s_v1 = np.concatenate((lab_s_v1, confdt_da), axis=0)
        lab_labl_set = np.concatenate((lab_labl_set, confdt_la), axis=0)

        # Remove confident instances from the unlabeled set
        unl_set = np.delete(unl_set, confdt_indcs, axis=0)
        unl_set = unl_set.reshape(unl_set.shape[0], 32, 32, 1)

    #The accuracy of classifier1 is obtained directly from its evaluate method, 
    # After co-training, evaluate the classifiers on the test set
    classf1_sc = classifier1.evaluate(t_s_1, lab_t_set, verbose=0)
    classf1_ac = classf1_sc[1]

    t_s_2d = t_s_1.reshape(t_s_1.shape[0], -1)
    classf2_ac = classifier2.score(t_s_2d, lab_t_set)

    #returns the accuracy of both classifiers on the test set.
    return classf1_ac, classf2_ac

#After the loop ends, both classifiers are evaluated on the test set. 
#The accuracy of classifier1 is obtained directly from its evaluate method, which seems to return a score object from which the accuracy is extracted. 
#For classifier2, the accuracy is obtained by using the score method



1-4. pick one of the classifiers and do the supervised training with the labeled data and calculate the accuracy

In [155]:
#This code defines the supervised learning model of either the CNN or the RF Classifier based on the classifier passed.
#The fuction returns the accuracy of the supervised model.
#The supervised_training_and_accuracy function is defined to perform supervised training using a given classifier on labeled training data and then assess the trained model's accuracy on test data. 
#The function takes five parameters: classifier, labeled_data, labeled_labels, test_data, and test_labels. 
#It returns a single float value representing the accuracy of the classifier on the test data.
def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    """
    Perform supervised training with a classifier on the labeled data and calculate the accuracy on test data.

    Parameters:
    - classifier: The classifier to be used for supervised training (e.g., Random Forest).
    - labeled_data (array-like): Labeled training data.
    - labeled_labels (array-like): Labels for the labeled training data.
    - test_data (array-like): Test data for evaluation.
    - test_labels (array-like): Labels for the test data.

    Returns:
    - accuracy (float): Accuracy of the classifier on the test data after supervised training.
    """
    #The classifier and the datasets are assigned to local variables classf, lab_d, t_d, lab_labls, and t_labls, respectively.
    classf = classifier
    lab_d = labeled_data
    t_d = test_data
    lab_labls = labeled_labels
    t_labls = test_labels
    
    
    #The function checks if the labeled_data and test_data are three-dimensional. 
    #If they are, the function reshapes these arrays into two-dimensional arrays. 
    #The need for reshaping suggests that the original data might be in a format suitable for image data (width x height x channels), 
    #and the classifier expects two-dimensional data (features x samples).
    if len(lab_d.shape) == 3:
        lab_d = lab_d.reshape(lab_d.shape[0], -1)
    if len(t_d.shape) == 3:
        t_d = t_d.reshape(t_d.shape[0], -1)
    
    #The classifier is trained using the fit method on the labeled_data and labeled_labels.
    classf.fit(lab_d, lab_labls)
    
    #The trained classifier is used to predict the labels for the test_data using the predict method.
    predct = classf.predict(t_d)
    
    #The accuracy of the predictions is calculated by comparing the predicted labels predct with the actual labels of the test set t_labls using the accuracy_score function. 
    #This function is likely imported from a library such as scikit-learn.
    ac = accuracy_score(t_labls, predct)
    
    #returns the calculated accuracy
    return ac

#this function encapsulates the typical supervised learning workflow for classification tasks. 
#It assumes that the classifier provided has a fit method for training and a predict method for inference, which is standard for classifiers in libraries like scikit-learn. 
#The function is generic enough to be used with various classifiers as long as they adhere to this interface.

1-5. Compare the Co-training approach accuracy and supervised model with limited labeled data and write your reason about it.

In [114]:
# your code and answer here
# This code calls the function co-training that id defined in the part 1-3 to calculate the accuracy of CNN classifier and the RF Classifier

#A variable threshold_confidence is set to 0.90. 
#This threshold is used within the co_training function to determine whether the predictions made by the classifiers on the unlabeled data are confident enough to be added to the labeled training set.
threshold_confidence = 0.90

#lab_set and t_set are tuples that contain labeled training data (along with corresponding labels) and test data (along with corresponding test labels), respectively. 
#The first element of each tuple contains the data, and the second contains the labels.
lab_set = (lab_s_v1, lab_labl_set)
t_set = (t_s_v1, lab_t_set)

#The co_training function is called with the following parameters:

#cnn_classifier: An instance of a Convolutional Neural Network classifier.
#rf_classifier: An instance of a Random Forest classifier.
#lab_set: The labeled dataset and labels.
#unl_s_v1: The unlabeled dataset (unlabeled training data).
#t_set: The test dataset and test labels.
#threshold_confidence: The confidence threshold for adding unlabeled samples to the training set.

cnn_ac, rf_ac = co_training(cnn_classifier, rf_classifier, lab_set, unl_s_v1, t_set, threshold_confidence)

#The above function call returns two values:
#1.cnn_ac: The accuracy of the CNN classifier after co-training.
#2.rf_ac: The accuracy of the Random Forest classifier after co-training.

#This line prints the accuracy of two classifier models defined
print(f"The Accuracy of CNN classifier: {cnn_ac} \nThe Accuracy of RF classifier:: {rf_ac}")

10827 len(unl_set)
Epoch 1/10
4/4 - 2s - loss: 1.6511 - accuracy: 0.4417 - 2s/epoch - 589ms/step
Epoch 2/10
4/4 - 0s - loss: 1.2918 - accuracy: 0.6167 - 84ms/epoch - 21ms/step
Epoch 3/10
4/4 - 0s - loss: 1.1045 - accuracy: 0.6333 - 75ms/epoch - 19ms/step
Epoch 4/10
4/4 - 0s - loss: 1.0244 - accuracy: 0.6667 - 79ms/epoch - 20ms/step
Epoch 5/10
4/4 - 0s - loss: 0.9057 - accuracy: 0.7500 - 83ms/epoch - 21ms/step
Epoch 6/10
4/4 - 0s - loss: 0.8157 - accuracy: 0.7750 - 75ms/epoch - 19ms/step
Epoch 7/10
4/4 - 0s - loss: 0.8642 - accuracy: 0.7167 - 71ms/epoch - 18ms/step
Epoch 8/10
4/4 - 0s - loss: 0.7039 - accuracy: 0.8083 - 91ms/epoch - 23ms/step
Epoch 9/10
4/4 - 0s - loss: 0.7508 - accuracy: 0.7333 - 73ms/epoch - 18ms/step
Epoch 10/10
4/4 - 0s - loss: 0.6511 - accuracy: 0.8000 - 85ms/epoch - 21ms/step
The Accuracy of CNN classifier: 0.5146198868751526 
The Accuracy of RF classifier:: 0.8636695906432749


In [120]:
# Supervised training with one of the classifiers (e.g., Random Forest)
#This is just a reference code mentioned here to print the accuracy of the supervised classifier with random forest classifier
rf_super_acc = supervised_training_and_accuracy(rf_classifier, lab_s_v1, lab_labl_set, t_s_v2, lab_t_set)
print(f"The Accuracy of supervised RF classifier is :{rf_super_acc}")

The Accuracy of supervised RF classifier is :0.5573830409356725


  return fit_method(estimator, *args, **kwargs)


In [116]:
#To Compare the accuracies of the different models
#simple if else statement to verify which among CNN co-training accuracy or RF supervised accuracy id higher
if cnn_ac > rf_super_acc:
    print("Co-training CNN accuracy is is higher than Supervised RF accuracy")
else:
    print("Co-training CNN accuracy is is lower than Supervised RF accuracy")
    
#simple if else statement to verify which among RF co-training accuracy or RF supervised accuracy id higher
if rf_ac > rf_super_acc:
    print("Co-training RF accuracy is is higher than Supervised RF accuracy")
else:
    print("Co-training RF accuracy is is lower than Supervised RF accuracy")

Co-training CNN accuracy is is lower than Supervised RF accuracy
Co-training RF accuracy is is higher than Supervised RF accuracy


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part 2. Label Propagation for Sea Ice Classification</div>


In this task, you'll be applying the label propagation technique to the dataset. Through this, you'll observe the outcomes of both semi-supervised learning and supervised learning when there's only a limited amount of labeled data available.


<img src="label_propagation.jpg" alt = "label propgation process" >

 2-1. Apply the K-Nearest Neighbors (KNN) algorithm with a parameter configuration where n_neighbors is set to 7 for the label propagation model. Utilize one of the labeled data views and the corresponding unlabeled data from part 1 as input.
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).

In [121]:
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
#K-Nearest Neighbors (KNN) algorithm to perform semi-supervised learning, specifically label propagation.
from sklearn.neighbors import KNeighborsClassifier
def label_propagation(labeled_data, unlabeled_data, labeled_labels, test_data, label_test, n_neighbors=7):
    """
    Apply K-Nearest Neighbors (KNN) to the label propagation model on one data view and test data.

    Parameters:
    - labeled_data (array-like): Labeled data points.
    - unlabeled_data (array-like): Unlabeled data points.
    - labeled_labels (array-like): Labels corresponding to the labeled data points.
    - test_data (array-like): Test data to evaluate label propagation performance.
    - label_test (array-like): Labels corresponding to the test data points.
    - n_neighbors (int): Number of neighbors to consider in KNN (default: 7).

    Returns:
    - accuracy (float): Accuracy of label propagation on the test data.
    """
    #initialize the parameters to local variables
    lab_d = labeled_data
    unl_d = unlabeled_data
    t_d = test_data
    lab_labls = labeled_labels
    lab_t = label_test
    
    #The function checks the shape of the input data arrays (labeled_data, unlabeled_data, and test_data). 
    #If any of these arrays have three dimensions, they are reshaped into two dimensions. 
    #This reshaping is likely because the KNN algorithm expects input data in a two-dimensional format (number of samples x number of features).
    if len(lab_d.shape) == 3:
        lab_d = lab_d.reshape(lab_d.shape[0], -1)
    if len(unl_d.shape) == 3:
        unl_d = unl_d.reshape(unl_d.shape[0], -1)
    if len(t_d.shape) == 3:
        t_d = t_d.reshape(t_d.shape[0], -1)
        
    
    #An instance of KNeighborsClassifier is created with n_neighbors set to the value passed to the function. 
    knn_classf = KNeighborsClassifier(n_neighbors=n_neighbors)
    
    #The classifier is then trained (fit) on the labeled_data with its corresponding labeled_labels.
    knn_classf.fit(lab_d, lab_labls)

    #The classifier is used to predict labels for the unlabeled_data.
    predt_lab = knn_classf.predict(unl_d)
    
    #The predicted labels for the unlabeled_data are concatenated with the labeled_labels. 
    #Then, predt_lab is resized to match the shape of lab_t (which represents the labels of the test data).
    if lab_labls.shape != predt_lab.shape:
        lab_labls = lab_labls.reshape(-1)    
    predt_lab = np.concatenate((predt_lab, lab_labls))
    predt_lab = np.resize(predt_lab, lab_t.shape)
    
    # Calculate accuracy
    accuracy = accuracy_score(lab_t, predt_lab)
    
    #return the value of accuracy
    return accuracy
    
n_neighbors = 7

#The function label_propagation is called with the appropriate parameters from part 1 of this assignment
#The accuracy is stored in the variable ac. 
ac = label_propagation(lab_s_v1, unl_s_v1, lab_labl_set, t_s_v1, lab_t_set, n_neighbors)

#The accuracy is printed to the console.
print("Accuracy of the semi-supervised label propagation:", ac)

  return self._fit(X, y)


Accuracy of the semi-supervised label propagation: 0.4791666666666667


2-2. Select a classification algorithm and perform supervised learning on the labeled set. Then, evaluate the model's performance by calculating the accuracy. You can use a built-in library for the classifier. Compare your sepervised and semi supervised accuracy.

In [124]:
from sklearn.svm import SVC
# Please write your code here. Comments are provided for guidance purposes. make adjustments as needed.
def supervised_training_and_accuracy(classifier, labeled_data, labeled_labels, test_data, test_labels):
    """
    Perform supervised training with a classifier on the labeled data and calculate the accuracy on test data.

    Parameters:
    - classifier: The classifier to be used for supervised training (e.g., Random Forest).
    - labeled_data (array-like): Labeled training data.
    - labeled_labels (array-like): Labels for the labeled training data.
    - test_data (array-like): Test data for evaluation.
    - test_labels (array-like): Labels for the test data.

    Returns:
    - accuracy (float): Accuracy of the classifier on the test data after supervised training.
    """
    #initialization to local variables
    lab_d = labeled_data
    t_d = test_data
    
    #If the labeled_data or test_data has three dimensions, the data is reshaped into two dimensions. 
    #This is because the SVM classifier expects data to be in a format where each row is a sample and each column is a feature.
    if len(lab_d.shape) == 3:
        lab_d = lab_d.reshape(lab_d.shape[0], -1)
    if len(t_d.shape) == 3:
        t_d = t_d.reshape(t_d.shape[0], -1)
        
    #A new SVC object is created regardless of the classifier passed in as a parameter.
    classifier = SVC()
    
    lab_labs = labeled_labels
    #The classifier is trained (fit) on the labeled_data and labeled_labels.
    classifier.fit(lab_d, lab_labs)
    
    t_labs = test_labels
    #The trained classifier is then used to predict the labels for the test_data. 
    predt = classifier.predict(t_d)
    
    #The above predictions are compared with the test_labels to calculate the accuracy using the accuracy_score function.
    # Calculate the accuracy of the classifier
    ac = accuracy_score(t_labs, predt)

    return ac


#This code calls the supervised training and accuracy function to evaluate the model's performance by calculating the accuracy.
ac_svm = supervised_training_and_accuracy(SVC(), lab_s_v1, lab_labl_set, t_s_v1, lab_t_set)

#The accuracy of SVM is found here
print("The accuracy of SVM:", ac_svm)


The accuracy of SVM: 0.8271198830409356


  y = column_or_1d(y, warn=True)


In [126]:
#Here, just comparing the accuracy of the accuracy of semi-supervised label propagation and the SVM model
#It is clear that the semi-supervised label propagation model has less accuracy than SVM
print("Accuracy of the semi-supervised label propagation is", ac, "which is less than that of active SVM ",ac_svm)

Accuracy of the semi-supervised label propagation is 0.4791666666666667 which is less than that of active SVM  0.8271198830409356


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">Part 3. Now let's perform some experimentation and make some observations!</div>



3-1. We will explore the impact of varying the threshold confidence in the co-training process at three different values: 80, 70, and 60. We will then assess the accuracy of co-training based on these threshold settings.
To provide a more understanding of the accuracy measure, please refer to the following link: [link](https://en.wikipedia.org/wiki/Accuracy_and_precision).

In [127]:
# your code here
#The will be evaluating the performance of a co-training algorithm at three different confidence threshold levels (80%, 70%, and 60%). 
#The co-training algorithm is a semi-supervised learning technique that uses two classifiers to teach each other from unlabeled data when they are confident enough in their predictions.
#This co-training algorithm is defined in part 1 of the assignment

#l_set represents the labeled dataset, which includes both the feature set lab_s_v1 and the corresponding labels lab_labl_set. 
#t_set represents the test dataset, composed of the feature set t_s_v1 and the test labels lab_t_set.
l_set = (lab_s_v1, lab_labl_set)
t_set = (t_s_v1, lab_t_set)

#The function co_training is called three times with different confidence threshold values: 
#The different confidence threshold values are 0.80, 0.70, and 0.60, which correspond to 80%, 70%, and 60% confidence thresholds respectively. 
#This is mentioned in the question

#function call with threshold values 0.8 or 80% and calculate accuracy
thresh_conf1 = 0.80
cnn_ac1, rf_ac1 = co_training(cnn_classifier, rf_classifier, l_set, unl_s_v1, t_set, thresh_conf1)

#function call with threshold values 0.7 or 70% and calculate accuracy
thresh_conf2 = 0.70
t_set = (t_s_v1, lab_t_set)
cnn_ac2, rf_ac2 = co_training(cnn_classifier, rf_classifier, l_set, unl_s_v1, t_set, thresh_conf2)

#function call with threshold values 0.6 or 60% and calculate accuracy
thresh_conf3 = 0.60
t_set = (t_s_v1, lab_t_set)
cnn_ac3, rf_ac3 = co_training(cnn_classifier, rf_classifier, l_set, unl_s_v1, t_set, thresh_conf2)


#The accuracies for each pair of classifiers at the different thresholds are printed out to give an overview
#how the confidence threshold affects the performance of the co-training process.
print("Accuracy of the co training of unlabelled set at threshold confidence of 80 for CNN is ",cnn_ac1," and for RF is ",rf_ac1)
print("Accuracy of the co training of unlabelled set at threshold confidence of 70 for CNN is ",cnn_ac2," and for RF is ",rf_ac2)
print("Accuracy of the co training of unlabelled set at threshold confidence of 60 for CNN is ",cnn_ac3," and for RF is ",rf_ac3)

10827 len(unl_set)
Epoch 1/10
4/4 - 0s - loss: 0.6417 - accuracy: 0.8000 - 137ms/epoch - 34ms/step
Epoch 2/10
4/4 - 0s - loss: 0.6554 - accuracy: 0.8167 - 134ms/epoch - 34ms/step
Epoch 3/10
4/4 - 0s - loss: 0.6494 - accuracy: 0.8000 - 127ms/epoch - 32ms/step
Epoch 4/10
4/4 - 0s - loss: 0.6383 - accuracy: 0.8083 - 130ms/epoch - 32ms/step
Epoch 5/10
4/4 - 0s - loss: 0.6411 - accuracy: 0.8250 - 135ms/epoch - 34ms/step
Epoch 6/10
4/4 - 0s - loss: 0.5806 - accuracy: 0.8083 - 135ms/epoch - 34ms/step
Epoch 7/10
4/4 - 0s - loss: 0.5745 - accuracy: 0.8417 - 123ms/epoch - 31ms/step
Epoch 8/10
4/4 - 0s - loss: 0.5944 - accuracy: 0.8083 - 144ms/epoch - 36ms/step
Epoch 9/10
4/4 - 0s - loss: 0.5024 - accuracy: 0.8583 - 111ms/epoch - 28ms/step
Epoch 10/10
4/4 - 0s - loss: 0.5539 - accuracy: 0.8750 - 102ms/epoch - 25ms/step
10827 len(unl_set)
Epoch 1/10
4/4 - 0s - loss: 0.5985 - accuracy: 0.8167 - 123ms/epoch - 31ms/step
Epoch 2/10
4/4 - 0s - loss: 0.5206 - accuracy: 0.8333 - 109ms/epoch - 27ms/step
E

In [134]:
#Just my explaination about the above section 3-1
#A higher confidence threshold means that a classifier will only teach the other classifier when it is more certain about its predictions. 
#This could lead to fewer updates and possibly higher quality updates, but may also mean the classifiers learn less from the unlabeled data.
#A lower confidence threshold allows for more frequent updates between the classifiers. 
#While this can lead to faster learning from the unlabeled data, it can also introduce noise if the predictions are not reliable.
print("A higher confidence threshold means that a classifier will only teach the other classifier when it is more certain about its predictions.\nThis could lead to fewer updates and possibly higher quality updates, but may also mean the classifiers learn less from the unlabeled data. \nA lower confidence threshold allows for more frequent updates between the classifiers. While this can lead to faster learning from the unlabeled data, it can also introduce noise if the predictions are not reliable.")

A higher confidence threshold means that a classifier will only teach the other classifier when it is more certain about its predictions.
This could lead to fewer updates and possibly higher quality updates, but may also mean the classifiers learn less from the unlabeled data. 
A lower confidence threshold allows for more frequent updates between the classifiers. While this can lead to faster learning from the unlabeled data, it can also introduce noise if the predictions are not reliable.


3-2. Change the parameters of the K-Nearest Neighbors (KNN) algorithm for Label Propagation (part 2) with the values 3, 5, and 10, and explain what you understand about these parameter adjustments.


In [130]:
# your code here
#In K-Nearest Neighbors (KNN), the parameter n_neighbors specifies the number of nearest neighbors to consider when making a classification decision. 
#In the context of label propagation, where KNN is used to infer labels for unlabeled data based on the labeled examples, changing the value of n_neighbors can significantly affect the performance and behavior of the classifier.
#In this code we are computing the accuracy of the KNN algorithm for labelled propagation for unlabelled data with
#with nearest neighbour value of 3
neigh_b1 = 3  # You can adjust this value
ac1 = label_propagation(lab_s_v1, unl_s_v1, lab_labl_set, t_s_v1, lab_t_set, neigh_b1)

#In this code we are computing the accuracy of the KNN algorithm for labelled propagation for unlabelled data with
#with nearest neighbour value of 5
neigh_b2 = 5  # You can adjust this value
ac2 = label_propagation(lab_s_v1, unl_s_v1, lab_labl_set, t_s_v1, lab_t_set, neigh_b2)

#In this code we are computing the accuracy of the KNN algorithm for labelled propagation for unlabelled data with
#with nearest neighbour value of 10
neigh_b3 = 10  # You can adjust this value
ac3 = label_propagation(lab_s_v1, unl_s_v1, lab_labl_set, t_s_v1, lab_t_set, neigh_b3)

#Prints the values of the accuracy with different KNN Neighbour values
print("The Accuracy of label propagation with KNN neighbours 3 is:", ac1)
print("The Accuracy of label propagation with KNN neighbours 5 is:", ac2)
print("The Accuracy of label propagation with KNN neighbours 10 is:", ac3)

  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)


The Accuracy of label propagation with KNN neighbours 3 is: 0.4692982456140351
The Accuracy of label propagation with KNN neighbours 5 is: 0.4692982456140351
The Accuracy of label propagation with KNN neighbours 10 is: 0.4791666666666667


In [135]:
#Here is the description of the parameter adjustments
'''
n_neighbors = 3: A smaller n_neighbors value means that the label is predicted based on a smaller, more local neighborhood of data points. It makes the classification more sensitive to the local data structure, which could be beneficial if the data is well-separated. However, it can also make the classifier more sensitive to noise in the data.
n_neighbors = 5: As we increase the number of neighbors, the algorithm starts to look at a slightly larger neighborhood. This can smooth out the predictions by considering more data points, which can sometimes improve accuracy by reducing the impact of noise. However, if the value is too high, it might include points from other classes, especially if the classes are not well-separated.
n_neighbors = 10: A larger value for n_neighbors takes into account a wider area of the data space when making a decision. This can lead to a more generalized model that may perform better when the class distribution is more uniform or when there is less noise. However, it can also dilute the influence of the closest neighbors and potentially lead to misclassification if distinct local patterns are important.
In label propagation, where you are trying to expand the labeled dataset with the knowledge gained from the unlabeled dataset, the choice of n_neighbors can be critical. A lower n_neighbors value will cause the algorithm to adhere closely to the labeled examples, potentially leading to a more conservative expansion of labels. A higher n_neighbors value can encourage broader generalization, but might also propagate labels too aggressively, leading to the spread of incorrect labels if care is not taken.
When comparing the performance of KNN with different values of n_neighbors, you are essentially comparing how well the algorithm performs with more localized vs. more generalized decision boundaries. The optimal number of neighbors depends on the specific dataset and the underlying distribution of the samples. It's often best to experiment with different values and use cross-validation to determine which setting gives the best results for your particular problem.
'''

print("n_neighbors = 3:\nA smaller n_neighbors value means that the label is predicted based on a smaller, more local neighborhood of data points. It makes the classification more sensitive to the local data structure, which could be beneficial if the data is well-separated. However, it can also make the classifier more sensitive to noise in the data.")
print("\nn_neighbors = 5: As we increase the number of neighbors, the algorithm starts to look at a slightly larger neighborhood. This can smooth out the predictions by considering more data points, which can sometimes improve accuracy by reducing the impact of noise. However, if the value is too high, it might include points from other classes, especially if the classes are not well-separated\nn_neighbors = 10: A larger value for n_neighbors takes into account a wider area of the data space when making a decision. This can lead to a more generalized model that may perform better when the class distribution is more uniform or when there is less noise. However, it can also dilute the influence of the closest neighbors and potentially lead to misclassification if distinct local patterns are important.")
print("\nIn label propagation, where you are trying to expand the labeled dataset with the knowledge gained from the unlabeled dataset, the choice of n_neighbors can be critical. A lower n_neighbors value will cause the algorithm to adhere closely to the labeled examples, potentially leading to a more conservative expansion of labels. A higher n_neighbors value can encourage broader generalization, but might also propagate labels too aggressively, leading to the spread of incorrect labels if care is not taken. When comparing the performance of KNN with different values of n_neighbors, you are essentially comparing how well the algorithm performs with more localized vs. more generalized decision boundaries. The optimal number of neighbors depends on the specific dataset and the underlying distribution of the samples. It's often best to experiment with different values and use cross-validation to determine which setting gives the best results for your particular problem.")

n_neighbors = 3:
A smaller n_neighbors value means that the label is predicted based on a smaller, more local neighborhood of data points. It makes the classification more sensitive to the local data structure, which could be beneficial if the data is well-separated. However, it can also make the classifier more sensitive to noise in the data.

n_neighbors = 5: As we increase the number of neighbors, the algorithm starts to look at a slightly larger neighborhood. This can smooth out the predictions by considering more data points, which can sometimes improve accuracy by reducing the impact of noise. However, if the value is too high, it might include points from other classes, especially if the classes are not well-separated
n_neighbors = 10: A larger value for n_neighbors takes into account a wider area of the data space when making a decision. This can lead to a more generalized model that may perform better when the class distribution is more uniform or when there is less noise. How



3-3. Let's see the impact of of a simplifies models for cotraining approach.
- Reduce the number of convolutional layers in the question 1-2 from 3 to 1 convolution layer and the rest of the layers is the same
- Change the number of trees for the random forest algorithm to 1.
Evaluate the performance of cotraining approach.
- Additionally, use the 1 layer convolution layer as the supervised model and evaluate the performance for supervised learning.


In [136]:
# your code here
#The below code has the reduced convolution layer from question 1-2 where it is reduced from 3 to 1
#All the other layers are same as the previous question 1-2
cnn_classifier = Sequential([
    Input(shape=(32, 32, 1)),
    BatchNormalization(),
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2), strides=2),
    BatchNormalization(),
    Flatten(),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(16, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.1),
    Dense(num_classes, activation='softmax')
])

cnn_classifier.compile(optimizer='adam',
                       loss='sparse_categorical_crossentropy',
                       metrics=['accuracy'])


#The number of trees for the random forest algorithm is set to 1 using the parameter 'n_estimators=1'
rf_classifier = RandomForestClassifier(n_estimators=1, random_state=42)

#In this code below, I have used the 1 convolution layer above as the supervised model

lab_set = (lab_s_v1, lab_labl_set)
t_set = (t_s_v1, lab_t_set)

#The function of co-training is called along with new CNN classifier as defined above in 3-3
thresh_conf1 = 0.90
cnn_ac_1, rf_ac_1 = co_training(cnn_classifier, rf_classifier, lab_set, unl_s_v1, t_set, thresh_conf1)

#The function of supervised_training_and_accuracy is called along with new RF classifier as defined above in 3-3
supv_ac = supervised_training_and_accuracy(cnn_classifier, lab_s_v1, lab_labl_set, t_s_v1, lab_t_set)

#The accuracy is evaluated and printed for co-training model with 1 convolution layer
print("The performance of the co-training model with 1 convolution layer for CNN classifier is ",cnn_ac_1,"\nThe performance of the co-training model with 1 convolution layer for RF classifier is ",rf_ac_1)

#The accuracy is evaluated and printed for supervised training model with 1 convolution layer
print("The performance of the supervised model with 1 convolution layer:", supv_ac)

10827 len(unl_set)
Epoch 1/10
4/4 - 3s - loss: 2.2897 - accuracy: 0.2583 - 3s/epoch - 813ms/step
Epoch 2/10
4/4 - 0s - loss: 1.4321 - accuracy: 0.4833 - 92ms/epoch - 23ms/step
Epoch 3/10
4/4 - 0s - loss: 1.2597 - accuracy: 0.6250 - 99ms/epoch - 25ms/step
Epoch 4/10
4/4 - 0s - loss: 0.9140 - accuracy: 0.7000 - 100ms/epoch - 25ms/step
Epoch 5/10
4/4 - 0s - loss: 0.7201 - accuracy: 0.7500 - 101ms/epoch - 25ms/step
Epoch 6/10
4/4 - 0s - loss: 0.8177 - accuracy: 0.7917 - 101ms/epoch - 25ms/step
Epoch 7/10
4/4 - 0s - loss: 0.7442 - accuracy: 0.7833 - 100ms/epoch - 25ms/step
Epoch 8/10
4/4 - 0s - loss: 0.6807 - accuracy: 0.7750 - 101ms/epoch - 25ms/step
Epoch 9/10
4/4 - 0s - loss: 0.6606 - accuracy: 0.7333 - 101ms/epoch - 25ms/step
Epoch 10/10
4/4 - 0s - loss: 0.5948 - accuracy: 0.7750 - 85ms/epoch - 21ms/step
The performance of the co-training model with 1 convolution layer for CNN classifier is  0.6860380172729492 
The performance of the co-training model with 1 convolution layer for RF cla

  y = column_or_1d(y, warn=True)


3-4. Let's adjust the amount of labeled data in part 2 by considering two different quantities: 200 and 400 labeled data points. In each scenario, the remaining data will remain unlabeled. Evaluate the performance of label propagation under these labeled data scenarios.


In [139]:
# your code here# your code here
#The code below is for a semi-supervised learning task where a K-Nearest Neighbors classifier is used for label propagation. 
#Label propagation is a semi-supervised technique that uses both labeled and unlabeled data to train a model. 
#The model first learns from the labeled data and then infers labels for the unlabeled data based on the learned patterns.

#The first part of the code sets up the labeled and unlabeled datasets based on two scenarios: 

#one with 200 labeled data points and the other with 400 labeled data points. 
#The view1_data array is assumed to be the entire dataset, and labels_data contains the true labels for the entire dataset.

#lab_d_200: Contains the first 200 data points from view1_data, which will be used as labeled data in the first scenario.
#lab_labl_200: Contains the labels corresponding to lab_d_200.
lab_d_200 = view1_data[:200]
lab_labl_200 = labels_data[:200]

#lab_d_400: Contains the first 400 data points from view1_data, which will be used as labeled data in the second scenario.
#lab_labl_400: Contains the labels corresponding to lab_d_400.
lab_d_400 = view1_data[:400]
lab_labl_400 = labels_data[:400]

#The unlabeled datasets for each scenario are created by excluding a portion of the data based on an indexing operation. 
#intention is to exclude a certain percentage (20% indicated by 0.2) of the data from the beginning and consider them as unlabeled:
#unl_indcs_200: Represents the data points that are treated as unlabeled in the first scenario, starting from after the first 200 labeled points plus 20% of the length of view1_data.
#unl_indcs_400: Represents the data points that are treated as unlabeled in the second scenario, starting from after the first 400 labeled points plus 20% of the length of view1_data.
unl_indcs_200 = view1_data[200 + int(len(view1_data)*0.2):]
unl_indcs_400 = view1_data[400 + int(len(view1_data)*0.2):]

#t_d and lab_t_d are variables for the test data and its corresponding labels, respectively. 
#these are taken from a slice of view1_data starting from 20% of the length of view1_data to the end.
t_d = view1_data[int(len(view1_data)*0.2):]
lab_t_d = labels_data[int(len(view1_data)*0.2):]

#prints the shapes of these arrays to verify the dimensions of the data being used for each scenario.
print("The Shape of labelled data with 200 labelled points:", lab_d_200.shape)
print("The Shape of labelled labels with 200 labelled points:", lab_labl_200.shape)
print("The Shape of labelled data with 400 labelled points:", lab_d_400.shape)
print("The Shape of labelled labels with 400 labelled points:", lab_labl_400.shape)
print("The Shape of unlabelled index data with 200 labelled points:", unl_indcs_200.shape)
print("The Shape of unlabelled index with 400 labelled points:", unl_indcs_400.shape)
print("The Shape of test data:", t_d.shape)
print("The Shape of test data:", lab_t_d.shape)

#The label_propagation function is called twice: 
#once with the 200 labeled data points and their corresponding unlabeled data, 
#and once with the 400 labeled data points and their corresponding unlabeled data. 
#In both calls, n_neighbors is set to 7 which means the KNN will look at the 7 nearest neighbors to make its prediction.
#The accuracy in both scenerios is evaluated and returned by the label propagation
ac_200 = label_propagation(lab_d_200, unl_indcs_200, lab_labl_200, t_d, lab_t_d, n_neighbors=7)
ac_400 = label_propagation(lab_d_400, unl_indcs_400, lab_labl_400, t_d, lab_t_d, n_neighbors=7)

#accuracy of label propagation for each scenario is printed here
print("Accuracy of the labelled propagarion with 200 labelled points is ", ac_200)
print("Accuracy of the labelled propagarion with 400 labelled points is ", ac_400)

The Shape of labelled data with 200 labelled points: (200, 32, 32)
The Shape of labelled labels with 200 labelled points: (200, 1)
The Shape of labelled data with 400 labelled points: (400, 32, 32)
The Shape of labelled labels with 400 labelled points: (400, 1)
The Shape of unlabelled index data with 200 labelled points: (10747, 32, 32)
The Shape of unlabelled index with 400 labelled points: (10547, 32, 32)
The Shape of test data: (10947, 32, 32)
The Shape of test data: (10947, 1)


  return self._fit(X, y)
  return self._fit(X, y)


Accuracy of the labelled propagarion with 200 labelled points is  0.3153375353978259
Accuracy of the labelled propagarion with 400 labelled points is  0.2935050698821595



3-5. Let's adjust the number of labeled data samples for part 1. Consider three scenarios: one with 200 labeled samples, another with 400 labeled samples, and a third with 600 labeled samples. In each scenario, the remaining data will remain unlabeled. Additionally, include an explanation of your understanding of how these parameter changes impact the algorithm


In [142]:
#The provided code is for a semi-supervised learning experiment where the performance of a Co-Training algorithm and a supervised Random Forest (RF) classifier is evaluated across different amounts of labeled data. 
#Co-Training is a semi-supervised learning approach that leverages two different models (in this case, a Convolutional Neural Network (CNN) and an RF classifier) that teach each other from the unlabeled data, using their high-confidence predictions as pseudo-labels.

#Labeled and unlabeled datasets are prepared for three scenarios with different amounts of labeled data: 200, 400, and 600 samples. 
#The rest of the data points are considered unlabeled.

#labeled data: with 200 labelled data points
lab_d_200 = view1_data[:200]
lab_labl_200 = labels_data[:200]

#labeled data: with 400 labelled data points
lab_d_400 = view1_data[:400]
lab_labl_400 = labels_data[:400]

#labeled data: with 600 labelled data points
lab_d_600 = view1_data[:600]
lab_labl_600 = labels_data[:600]

#Remaining unlabelled data: 
unl_indcs_200 = view1_data[200 + int(len(view1_data)*0.2):]
unl_indcs_400 = view1_data[400 + int(len(view1_data)*0.2):]
unl_indcs_600 = view1_data[600 + int(len(view1_data)*0.2):]

#Test data and its labels
t_d = view1_data[int(len(view1_data)*0.2):]
lab_t_d = labels_data[int(len(view1_data)*0.2):]

#A threshold confidence (thresh_confd) of 0.90 is set for the Co-Training algorithm. 
#This threshold is used by the Co-Training process to determine which predictions are confident enough to be used as pseudo-labels.
thresh_confd = 0.90

#200 Labeled samples
lab_s_200 = (lab_d_200, lab_labl_200)
t_set = (t_d, lab_t_d)

#400 Labeled samples
lab_s_400 = (lab_d_400, lab_labl_400)
t_set = (t_d, lab_t_d)

#600 Labeled samples
lab_s_600 = (lab_d_600, lab_labl_600)
t_set = (t_d, lab_t_d)

#Evaluate the co-training for 200 labels
cnn_ac200, rf_ac200 = co_training(cnn_classifier, rf_classifier, lab_s_200, unl_indcs_200, t_set, thresh_confd)

#Evaluate the supervised training for 200 labels
rf_sup_ac200 = supervised_training_and_accuracy(rf_classifier, lab_d_200, lab_labl_200, t_d, lab_t_d)

#Evaluate the co-training for 400 labels
cnn_ac400, rf_ac400 = co_training(cnn_classifier, rf_classifier, lab_s_400, unl_indcs_400, t_set, thresh_confd)

#Evaluate the supervised training for 400 labels
rf_sup_ac400 = supervised_training_and_accuracy(rf_classifier, lab_d_400, lab_labl_400, t_d, lab_t_d)

#Evaluate the co-training for 600 labels
cnn_ac600, rf_ac600 = co_training(cnn_classifier, rf_classifier, lab_s_600, unl_indcs_600, t_set, thresh_confd)

#Evaluate the supervised training for 400 labels
rf_sup_ac600 = supervised_training_and_accuracy(rf_classifier, lab_d_600, lab_labl_600, t_d, lab_t_d)

print("Accuracy of the CNN co-training with 200 labelled points is", cnn_ac200,"\nand RF co-training with 200 labelled points is ",rf_ac200)
print("Accuracy of supervised training with 200 labelled points is", rf_sup_ac200)
print("Accuracy of the CNN co-training with 400 labelled points is", cnn_ac400,"\nand RF co-training with 400 labelled points is ",rf_ac400)
print("Accuracy of supervised training with 400 labelled points is", rf_sup_ac400)
print("Accuracy of the CNN co-training with 600 labelled points is", cnn_ac600,"\nand RF co-training with 600 labelled points is ",rf_ac600)
print("Accuracy of supervised training with 600 labelled points is",rf_sup_ac600)


10747 len(unl_set)
Epoch 1/10
7/7 - 0s - loss: 0.1968 - accuracy: 0.9500 - 105ms/epoch - 15ms/step
Epoch 2/10
7/7 - 0s - loss: 0.1974 - accuracy: 0.9500 - 134ms/epoch - 19ms/step
Epoch 3/10
7/7 - 0s - loss: 0.1969 - accuracy: 0.9500 - 133ms/epoch - 19ms/step
Epoch 4/10
7/7 - 0s - loss: 0.1997 - accuracy: 0.9500 - 100ms/epoch - 14ms/step
Epoch 5/10
7/7 - 0s - loss: 0.1976 - accuracy: 0.9500 - 135ms/epoch - 19ms/step
Epoch 6/10
7/7 - 0s - loss: 0.1937 - accuracy: 0.9500 - 170ms/epoch - 24ms/step
Epoch 7/10
7/7 - 0s - loss: 0.1935 - accuracy: 0.9500 - 182ms/epoch - 26ms/step
Epoch 8/10
7/7 - 0s - loss: 0.1937 - accuracy: 0.9500 - 153ms/epoch - 22ms/step
Epoch 9/10
7/7 - 0s - loss: 0.1916 - accuracy: 0.9500 - 178ms/epoch - 25ms/step
Epoch 10/10
7/7 - 0s - loss: 0.1936 - accuracy: 0.9500 - 175ms/epoch - 25ms/step


  y = column_or_1d(y, warn=True)


10547 len(unl_set)
Epoch 1/10
13/13 - 0s - loss: 0.1795 - accuracy: 0.9750 - 315ms/epoch - 24ms/step
Epoch 2/10
13/13 - 0s - loss: 0.1767 - accuracy: 0.9750 - 325ms/epoch - 25ms/step
Epoch 3/10
13/13 - 0s - loss: 0.1768 - accuracy: 0.9750 - 331ms/epoch - 25ms/step
Epoch 4/10
13/13 - 0s - loss: 0.1797 - accuracy: 0.9725 - 327ms/epoch - 25ms/step
Epoch 5/10
13/13 - 0s - loss: 0.1705 - accuracy: 0.9750 - 348ms/epoch - 27ms/step
Epoch 6/10
13/13 - 0s - loss: 0.1722 - accuracy: 0.9725 - 329ms/epoch - 25ms/step
Epoch 7/10
13/13 - 0s - loss: 0.1671 - accuracy: 0.9750 - 323ms/epoch - 25ms/step
Epoch 8/10
13/13 - 0s - loss: 0.1665 - accuracy: 0.9750 - 315ms/epoch - 24ms/step
Epoch 9/10
13/13 - 0s - loss: 0.1648 - accuracy: 0.9750 - 328ms/epoch - 25ms/step
Epoch 10/10
13/13 - 0s - loss: 0.1637 - accuracy: 0.9750 - 334ms/epoch - 26ms/step


  y = column_or_1d(y, warn=True)


10347 len(unl_set)
Epoch 1/10
19/19 - 0s - loss: 0.1659 - accuracy: 0.9683 - 344ms/epoch - 18ms/step
Epoch 2/10
19/19 - 0s - loss: 0.1697 - accuracy: 0.9667 - 427ms/epoch - 22ms/step
Epoch 3/10
19/19 - 0s - loss: 0.1604 - accuracy: 0.9683 - 463ms/epoch - 24ms/step
Epoch 4/10
19/19 - 0s - loss: 0.1587 - accuracy: 0.9683 - 484ms/epoch - 25ms/step
Epoch 5/10
19/19 - 0s - loss: 0.1582 - accuracy: 0.9683 - 472ms/epoch - 25ms/step
Epoch 6/10
19/19 - 0s - loss: 0.1568 - accuracy: 0.9683 - 494ms/epoch - 26ms/step
Epoch 7/10
19/19 - 1s - loss: 0.1538 - accuracy: 0.9683 - 699ms/epoch - 37ms/step
Epoch 8/10
19/19 - 0s - loss: 0.1536 - accuracy: 0.9683 - 473ms/epoch - 25ms/step
Epoch 9/10
19/19 - 0s - loss: 0.1504 - accuracy: 0.9683 - 492ms/epoch - 26ms/step
Epoch 10/10
19/19 - 0s - loss: 0.1479 - accuracy: 0.9683 - 457ms/epoch - 24ms/step


  y = column_or_1d(y, warn=True)


Accuracy of the CNN co-training with 200 labelled points is 0.5598794221878052 
and RF co-training with 200 labelled points is  0.6041837946469353
Accuracy of supervised training with 200 labelled points is 0.34237690691513656
Accuracy of the CNN co-training with 400 labelled points is 0.5598794221878052 
and RF co-training with 400 labelled points is  0.6041837946469353
Accuracy of supervised training with 400 labelled points is 0.34237690691513656
Accuracy of the CNN co-training with 600 labelled points is 0.5598794221878052 
and RF co-training with 600 labelled points is  0.6041837946469353
Accuracy of supervised training with 600 labelled points is 0.34237690691513656



3-6. Evalute the perfomance for different number of unlabeled data size.
- Set labeled data size within the range of 100 to 130 and
- Set the unlabeled data sizes at 200, 400, and 600.
- Execute the algorithms and provide accuracy reports for both approaches: co-training and label propagation.


In [145]:
#evaluate the performance of two semi-supervised learning approaches, co-training and label propagation
#with a fixed size of labeled data and varying sizes of unlabeled data. 

#A fixed number of samples (120) from view1_data are selected as the labeled dataset
lab_d = view1_data[:120]

#along with their corresponding labels from labels_data.
lab_labl = labels_data[:120]

#Three different subsets of view1_data are created for unlabeled data, containing 200, 400, and 600 samples respectively
ubl_indcs_200 = view1_data[121:321]
ubl_indcs_400 = view1_data[121:521]
ubl_indcs_600 = view1_data[121:721]

#The remaining parts of view1_data are used as test datasets, along with their corresponding labels.
#to evaluate the performance of the algorithms. 
#Each test dataset corresponds to one of the unlabeled subsets, with test data coming after the corresponding unlabeled data.
t_d_200 = view1_data[321:]
lab_t_d_200 = labels_data[321:]

t_d_400 = view1_data[521:]
lab_t_d_400 = labels_data[521:]

t_d_600 = view1_data[721:]
lab_t_d_600 = labels_data[721:]

#The co_training function is called three times, each with the labeled dataset and one of the three unlabeled datasets. 
#This function applies the co-training algorithm using a CNN and an RF classifier. 
#The confidence threshold for pseudo-labeling is set to 0.90
thresh_confdt = 0.90

#200 Labeled samples
lab_set = (lab_d, lab_labl)
t_s_200 = (t_d_200, lab_t_d_200)

#Evaluate the co-training for 200 labels
cnn_ac_200, rf_ac_200 = co_training(cnn_classifier, rf_classifier, lab_set, ubl_indcs_200, t_s_200, thresh_confdt)

#Evaluate the supervised training for 200 labels
rf_super_ac_200 = supervised_training_and_accuracy(rf_classifier, lab_d, lab_labl, t_d_200, lab_t_d_200)

#400 Labeled samples
lab_set = (lab_d, lab_labl)
t_s_400 = (t_d_400, lab_t_d_400)

#Evaluate the co-training for 400 labels
cnn_ac_400, rf_ac_400 = co_training(cnn_classifier, rf_classifier, lab_set, ubl_indcs_400, t_s_400, thresh_confdt)

#Evaluate the supervised training for 400 labels
rf_super_ac_400 = supervised_training_and_accuracy(rf_classifier, lab_d, lab_labl, t_d_400, lab_t_d_400)

#600 Labeled samples
lab_set = (lab_d, lab_labl)
t_s_600 = (t_d_600, lab_t_d_600)

#Evaluate the co-training for 600 labels
cnn_ac_600, rf_ac_600 = co_training(cnn_classifier, rf_classifier, lab_set, ubl_indcs_600, t_s_600, thresh_confdt)

#Evaluate the supervised training for 600 labels
rf_super_ac_600 = supervised_training_and_accuracy(rf_classifier, lab_d, lab_labl, t_d_600, lab_t_d_600)

#Evaluate the label propagation for three different sets of unlabelled data.
ac_200 = label_propagation(lab_d, ubl_indcs_200, lab_labl, t_d_200, lab_t_d_200, n_neighbors=7)
ac_400 = label_propagation(lab_d, ubl_indcs_400, lab_labl, t_d_400, lab_t_d_400, n_neighbors=7)
ac_600 = label_propagation(lab_d, ubl_indcs_600, lab_labl, t_d_600, lab_t_d_600, n_neighbors=7)

#Accuracy reports for both approaches: co-training and label propagation.
print("Here is the report of accuracies for both co-training and label propagation approaches:\n")
print("Accuracy of the CNN co-training with unlabelled data size 200:", cnn_ac_200,"\nAccuracy of RF co-training with unlabelled data size 200: ",rf_ac_200)
print("Accuracy of supervised training with unlabelled data size 200:", rf_super_ac_200)
print("Accuracy of the CNN co-training with unlabelled data size 400:", cnn_ac_400,"\nAccuracy of RF co-training with unlabelled data size 400: ",rf_ac_400)
print("Accuracy of supervised training with unlabelled data size 400:", rf_super_ac_400)
print("Accuracy of the CNN co-training with unlabelled data size 600:", cnn_ac_600,"\nAccuracy of RF co-training with unlabelled data size 600: ",rf_ac_600)
print("Accuracy of supervised training with unlabelled data size 600:",rf_super_ac_600)
print("Accuracy of labelled propagation model with unlabelled data size 200:",ac_200)
print("Accuracy of labelled propagation model with unlabelled data size 400:",ac_400)
print("Accuracy of labelled propagation model with unlabelled data size 600:",ac_600)

print("\nBy comparing the accuracies, I have analyzed how the amount of unlabeled data impacts the performance of both semi-supervised learning approaches.\nGenerally, having more unlabeled data allows for more learning from unlabeled instances, which can lead to better performance, especially for algorithms designed to leverage large amounts of unlabeled data effectively.\nHowever, the effectiveness can also depend on the quality of the unlabeled data and how representative it is of the overall data distribution.\nMore unlabeled data does not automatically guarantee better performance if that data is noisy or if the algorithms do not have mechanisms to deal with potentially misleading information.")

200 len(unl_set)
Epoch 1/10
4/4 - 0s - loss: 0.1391 - accuracy: 0.9583 - 99ms/epoch - 25ms/step
Epoch 2/10
4/4 - 0s - loss: 0.1422 - accuracy: 0.9583 - 86ms/epoch - 21ms/step
Epoch 3/10
4/4 - 0s - loss: 0.1385 - accuracy: 0.9583 - 118ms/epoch - 30ms/step
Epoch 4/10
4/4 - 0s - loss: 0.1366 - accuracy: 0.9583 - 111ms/epoch - 28ms/step
Epoch 5/10
4/4 - 0s - loss: 0.1387 - accuracy: 0.9583 - 115ms/epoch - 29ms/step
Epoch 6/10
4/4 - 0s - loss: 0.1370 - accuracy: 0.9583 - 100ms/epoch - 25ms/step
Epoch 7/10
4/4 - 0s - loss: 0.1383 - accuracy: 0.9583 - 107ms/epoch - 27ms/step
Epoch 8/10
4/4 - 0s - loss: 0.1355 - accuracy: 0.9583 - 118ms/epoch - 29ms/step
Epoch 9/10
4/4 - 0s - loss: 0.1365 - accuracy: 0.9583 - 87ms/epoch - 22ms/step
Epoch 10/10
4/4 - 0s - loss: 0.1359 - accuracy: 0.9583 - 116ms/epoch - 29ms/step


  y = column_or_1d(y, warn=True)


400 len(unl_set)
Epoch 1/10
4/4 - 0s - loss: 0.1351 - accuracy: 0.9583 - 103ms/epoch - 26ms/step
Epoch 2/10
4/4 - 0s - loss: 0.1345 - accuracy: 0.9583 - 84ms/epoch - 21ms/step
Epoch 3/10
4/4 - 0s - loss: 0.1395 - accuracy: 0.9583 - 103ms/epoch - 26ms/step
Epoch 4/10
4/4 - 0s - loss: 0.1382 - accuracy: 0.9583 - 114ms/epoch - 28ms/step
Epoch 5/10
4/4 - 0s - loss: 0.1379 - accuracy: 0.9583 - 118ms/epoch - 29ms/step
Epoch 6/10
4/4 - 0s - loss: 0.1359 - accuracy: 0.9583 - 115ms/epoch - 29ms/step
Epoch 7/10
4/4 - 0s - loss: 0.1305 - accuracy: 0.9583 - 100ms/epoch - 25ms/step
Epoch 8/10
4/4 - 0s - loss: 0.1341 - accuracy: 0.9583 - 98ms/epoch - 24ms/step
Epoch 9/10
4/4 - 0s - loss: 0.1363 - accuracy: 0.9583 - 112ms/epoch - 28ms/step
Epoch 10/10
4/4 - 0s - loss: 0.1322 - accuracy: 0.9583 - 102ms/epoch - 25ms/step


  y = column_or_1d(y, warn=True)


600 len(unl_set)
Epoch 1/10
4/4 - 0s - loss: 0.1291 - accuracy: 0.9583 - 62ms/epoch - 16ms/step
Epoch 2/10
4/4 - 0s - loss: 0.1289 - accuracy: 0.9583 - 61ms/epoch - 15ms/step
Epoch 3/10
4/4 - 0s - loss: 0.1347 - accuracy: 0.9583 - 64ms/epoch - 16ms/step
Epoch 4/10
4/4 - 0s - loss: 0.1309 - accuracy: 0.9583 - 80ms/epoch - 20ms/step
Epoch 5/10
4/4 - 0s - loss: 0.1294 - accuracy: 0.9583 - 84ms/epoch - 21ms/step
Epoch 6/10
4/4 - 0s - loss: 0.1282 - accuracy: 0.9583 - 64ms/epoch - 16ms/step
Epoch 7/10
4/4 - 0s - loss: 0.1342 - accuracy: 0.9583 - 78ms/epoch - 19ms/step
Epoch 8/10
4/4 - 0s - loss: 0.1306 - accuracy: 0.9583 - 69ms/epoch - 17ms/step
Epoch 9/10
4/4 - 0s - loss: 0.1360 - accuracy: 0.9583 - 64ms/epoch - 16ms/step
Epoch 10/10
4/4 - 0s - loss: 0.1296 - accuracy: 0.9583 - 55ms/epoch - 14ms/step


  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)


Here is the report of accuracies for both co-training and label propagation approaches:

Accuracy of the CNN co-training with unlabelled data size 200: 0.724816620349884 
Accuracy of RF co-training with unlabelled data size 200:  0.6700344259841341
Accuracy of supervised training with unlabelled data size 200: 0.3580302349947613
Accuracy of the CNN co-training with unlabelled data size 400: 0.7212429642677307 
Accuracy of RF co-training with unlabelled data size 400:  0.6656283239629236
Accuracy of supervised training with unlabelled data size 400: 0.3488831484576812
Accuracy of the CNN co-training with unlabelled data size 600: 0.6347785592079163 
Accuracy of RF co-training with unlabelled data size 600:  0.6605462120043203
Accuracy of supervised training with unlabelled data size 600: 0.33891374787841383
Accuracy of labelled propagation model with unlabelled data size 200: 0.5571022302050591
Accuracy of labelled propagation model with unlabelled data size 400: 0.5575140556146483
Accu

<h2>Submission</h2>

<hr style="border-top: 5px solid orange; margin-top: 1px; margin-bottom: 1px"></hr>

<p style="text-align: justify;">You need to submit a Jupyter Notebook (*.ipynb) file that contains your completed code.


<span>The file name should be in <strong>FirstName_LastName</strong> format</span>.</p>
<p style="text-align: justify;"><span>DO NOT INCLUDE EXTRA FILES, SUCH AS THE INPUT DATASETS</span>, in your submission;</p>
<p style="text-align: justify;">Please download your assignment after submission and make sure it is not corrupted or empty! We will not be responsible for corrupted submissions and will not take a resubmission after the deadline.</p>

Need Help?
If you need help with this assignment, please get in touch with TAs via their emails, or go to their office hours.
You are highly encouraged to ask your question on the designated channel for Assignment o on Microsoft Teams (not necessarily monitored by the instructor/TAs). Feel free to help other students with general questions. However, DO NOT share your solution.<hr style="border-top: 5px solid orange; margin-top: 1px; margin-bottom: 1px"></hr>