<a href="https://colab.research.google.com/github/HardikPaliwal/CS484Proj/blob/master/cs484%20proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CS484 Final Project**

#### Topic 6: Weakly supervised classification

#### Hardik Paliwal (20725413), Lance Pereira (20719626)

______________________________________________________________

#**Table Of Contents**
- A) Abstract
- B) High Level Goals and Methedology
- C) Team Members and Contributions
- D) External Code Libraries
- E) Code
  - 1) Setup (imports, loading data)
  - 2) Define our Base CNN (Based on VGG11), Test, Train methods
  - 3) Split Fashion MNIST training data into M labeled images, and N-M unlabeled
  - 4) Train Base CNN on M labeled training images
  - 5) Use the 'semi' trained CNN to gain important features (feature embedings) of N-M *unlabeled* training images
  - 6) Match cluster labels to actual Fashion MNIST labels
  - 7) Define retrained CNN function using predicted labels from clustering for unlabeled data
- F) Experiments
  - 1) Diffirent ratios of M/N
  - 2) Diffirent Clustering Methods
- G) Results
- H) Conclusions

#**A) Abstract:**

For our project we have decided to do Project 6, choosing specifically Fashion MNIST. We use weakly supervised classification using feature extraction (embedings similar to a PCA) along with clustering methods to try and improve results compared to just using supervised learning.
 

#**B) High Level Goals and Methedology:**

Out high level goal is to use weakly supervised classification to improve our 
prediction ability compared to just training using labeled images. We hope to 
achieve atleast a 5% increase in our test prediction score using clustering 
methods such as Kmeans and Mini-Batch Kmeans. Our attempts to use other clustering methods were stopped by our limited RAM capacity, but Kmeans and Mini-Batch Kmeans are quite effective.

Our method is split into two experiments, 
- we first will test to see training
a CNN on M labeled images, then use a cluestering method to classify all the 
N images, using the majority label in each cluster as the predicted label for the unlabeled N-M images
- secondly we will try using the predicted labels from the previous step to 
train a new CNN model, to see if it performs better than the original CNN model trained on just the M labeled images

Our baseline will be a simple CNN  trained on the M labeled images. We will use this to see if our weakly supervised method improved the accuracy rate. 

#**C) Team Members and Contributions**

- Hardik Paliwal
  - Created CNN based on VGG11 for training
  - Created function to gather features (feature embedings) from pretrained CNN (trained on M labeled images)
  - Created method to get predicted labels from clustering methods
  - Created train,test methods

- Lance Pereira
  - Create function to retrain CNN using clustered labels
  - Modified clustering prediction methods to use only labeled training data
  - Created experiments for diffirent ratios of M:M-N
  - Created function to split data into labeled and unlabeled
  - Created experiments for diffirent clustering methods
    - Mini Batch Kmeans
    - Kmeans
  - Wrote/formated report



#**D) External Code Libraries**

We used

- Pytorch
  - Because it was crucial for the quick training of our models
  - Allowed us to not have to deal with calculating back propogation
- Sklearn
  - Provided us a large range of clustering methods for quick experimentation
- Numpy
  - Useful for large matrix operations

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision as tv
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torch.autograd import Variable
import sklearn
from torch import optim
import numpy as np
import itertools

%matplotlib inline

# **E) Code**

### **E) [1] Code: Setup**

The below cells:
- Import Fshion MNIST
- define functions that strip label from data

In [2]:
# Constants
dev=torch.device("cuda") 
NUM_EPOCHS = 6
NUM_CLUSTERS = 30
NUM_CLASSES = 10
UNKNOWN_CLASS = 11

In [3]:
trainset = tv.datasets.FashionMNIST(root="./", download=True,train=True,  transform=tv.transforms.Compose(
    [tv.transforms.Resize(32), tv.transforms.ToTensor()]))
trainloader = DataLoader(trainset, batch_size=128, shuffle=True)

testset = tv.datasets.FashionMNIST(root="./", download=True,train=False,  transform=tv.transforms.Compose(
    [tv.transforms.Resize(32), tv.transforms.ToTensor()]))
testloader = DataLoader(testset, batch_size=128, shuffle=True)

### **E) [2] Code: Define our Base CNN (Based on VGG11), Test, Train methods**

In [4]:
#This is an implementation of VGG11 (which is a precursor to VGG16) for mnist dataset.
# it also takes in n, which is the number of classes. N+1 class stands for unknown. 

#this will let us differeniate the unlabelled data from the labelled data
class BasicNet(nn.Module):
    def __init__(self, n=9):
        super(BasicNet, self).__init__()
        self.batchNorm = [nn.BatchNorm2d(64), nn.BatchNorm2d(128),nn.BatchNorm2d(256), nn.BatchNorm2d(256),
                          nn.BatchNorm2d(512), nn.BatchNorm2d(512), nn.BatchNorm2d(512), nn.BatchNorm2d(512)]
        self.conv = [
        nn.Conv2d(1, 64, 3, 1, 1) ,nn.Conv2d(64, 128, 3, 1, 1), nn.Conv2d(128, 256, 3, 1, 1), nn.Conv2d(256, 256, 3, 1, 1)
       ,nn.Conv2d(256, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1)
        ]
        maxPool = nn.MaxPool2d(2, stride=2)
        self.conv1 = nn.Sequential(self.conv[0], self.batchNorm[0], nn.ReLU(), maxPool)
        self.conv2 = nn.Sequential(self.conv[1], self.batchNorm[1], nn.ReLU(), maxPool)
        self.conv3 = nn.Sequential(self.conv[2], self.batchNorm[2], nn.ReLU())
        self.conv4 = nn.Sequential(self.conv[3], self.batchNorm[3], nn.ReLU(), maxPool) 
        self.conv5 = nn.Sequential(self.conv[4], self.batchNorm[4], nn.ReLU())
        self.conv6 = nn.Sequential(self.conv[5], self.batchNorm[5], nn.ReLU(), maxPool)
        self.conv7 = nn.Sequential(self.conv[6], self.batchNorm[6], nn.ReLU()) 
        self.conv8 = nn.Sequential(self.conv[7], self.batchNorm[7], nn.ReLU(), maxPool)
        self.fc1 = nn.Linear(512, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, n+1)
        
    def forward(self, x, feature_embedding=False):
        dropOut = nn.Dropout(p=0.5)

        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.conv5(x)
        x = self.conv6(x)
        x = self.conv7(x)
        x = self.conv8(x)
        x = torch.flatten(x, 1)
        
        x = dropOut(F.relu(self.fc1(x)))
        x = dropOut(F.relu(self.fc2(x)))
        if(feature_embedding):
          return x
        #Not sure why this works without softmax. probably a reasoning givin in the paper (cause the output can range from anything (not normalized to a 0-1 probability range))
        x = self.fc3(x)
        return x

In [5]:
def test(data, net):
    net.eval()
    loss_func = nn.CrossEntropyLoss()

    total_correct = 0
    total_loss = 0
    with torch.no_grad():
        correct = 0
        total = 0
        for i, (images, labels) in enumerate(data):
            images= images.to(dev)
            labels = labels.to(dev)
            test_pred = net(images)

            pred = torch.max(test_pred, 1)[1].data.squeeze()
            total_correct+= (pred == labels).sum().item()
            loss = loss_func( test_pred, labels)
            total_loss+= loss.item()*images.size(0)
        # return total_correct/len(data.dataset), total_loss/len(data.dataset)
        # Note From Lance: We shouldn't divide the total loss
        return total_correct/len(data.dataset), total_loss
  
def train(num_epochs, net, trainloader):
    optimizer = optim.SGD(net.parameters(), lr=0.01)
    loss_func = nn.CrossEntropyLoss()

    accuracy_through_epochs = []
    total_step = len(trainloader)
    
    for epoch in range(num_epochs):
        net.train()
        for i, (images, labels) in enumerate(trainloader):
            images= images.to(dev)
            labels = labels.to(dev)
            optimizer.zero_grad()           
            prediction = net(images)
            loss = loss_func( prediction, labels)
            loss.backward()
            optimizer.step()
            if ((i +1) % 100 == 0):
                print(f"Epoch {epoch+1} / {num_epochs}, Step {i+1}/ {total_step} , Loss {loss.item()}")

    return accuracy_through_epochs, net

### **E) [3] Code: Split Fashion MNIST training data into M labeled images, and N-M unlabeled**

In [6]:
#Modifies dataset in place to only have values correspounding to the labels in classesToUse
def splitTrainingData(training_data, M_percent):
  # N is len(training_data)
  len_N = len(trainset)

  # M is the number of labeled images we want
  len_M = int(M_percent*len_N)
  
  labeled_data, unlabeled_data = torch.utils.data.random_split(trainset, [len_M, len_N - len_M])

  # strip the labels from unlabeled_data
  # unlabeled_data.dataset.targets[unlabeled_data.indices] = UNKNOWN_CLASS

  labeled_data_loader = DataLoader(labeled_data, batch_size=128, shuffle=True)
  unlabeled_data_loader = DataLoader(unlabeled_data, batch_size=128, shuffle=True)

  return labeled_data_loader, unlabeled_data_loader, labeled_data, unlabeled_data


In [7]:
labeled_train_loader, unlabeled_train_loader, labeled_train_data, unlabeled_train_data = splitTrainingData(trainset, 0.7)

### **E) [4] Code: Train Base CNN on M labeled training images**

In [8]:
net = BasicNet()
net.to(dev)
result, trained_net = train(NUM_EPOCHS, net, labeled_train_loader)

Epoch 1 / 6, Step 100/ 329 , Loss 0.6301531195640564
Epoch 1 / 6, Step 200/ 329 , Loss 0.4433915317058563
Epoch 1 / 6, Step 300/ 329 , Loss 0.30672404170036316
Epoch 2 / 6, Step 100/ 329 , Loss 0.30975407361984253
Epoch 2 / 6, Step 200/ 329 , Loss 0.2942114472389221
Epoch 2 / 6, Step 300/ 329 , Loss 0.14940808713436127
Epoch 3 / 6, Step 100/ 329 , Loss 0.2286154180765152
Epoch 3 / 6, Step 200/ 329 , Loss 0.2465631663799286
Epoch 3 / 6, Step 300/ 329 , Loss 0.3027246594429016
Epoch 4 / 6, Step 100/ 329 , Loss 0.12336543202400208
Epoch 4 / 6, Step 200/ 329 , Loss 0.1637580841779709
Epoch 4 / 6, Step 300/ 329 , Loss 0.21599192917346954
Epoch 5 / 6, Step 100/ 329 , Loss 0.12565259635448456
Epoch 5 / 6, Step 200/ 329 , Loss 0.2050739824771881
Epoch 5 / 6, Step 300/ 329 , Loss 0.12265955656766891
Epoch 6 / 6, Step 100/ 329 , Loss 0.11220794171094894
Epoch 6 / 6, Step 200/ 329 , Loss 0.10667518526315689
Epoch 6 / 6, Step 300/ 329 , Loss 0.18989360332489014


### **E) [5] Code: Use the 'semi' trained CNN to gain important features (feature embedings) of N-M *unlabeled* training images**

In [9]:
#do this to store the results in 1 numpy array of 10000 images vs like 20 batches of size 128 images. 
#get memory error when doing it on a batch of size len(trainloader), so we have to combine the results for trainloader

def getFeatureEmbedings(dataloader):
  featureEmbed = []
  predictedDigit = []

  with torch.no_grad():
    for i, (images, labels) in enumerate(dataloader):
        images= images.to(dev)
        labels = labels.to(dev)
        featureEmbed.append(net(images, feature_embedding=True).to("cpu").numpy())
        pred = net(images)
        predictedDigit.append(torch.max(pred, 1)[1].data.squeeze().to("cpu").numpy())

  # flatten lists
  featureEmbed = np.array(list(itertools.chain(*featureEmbed)))
  predictedDigit = np.array(list(itertools.chain(*predictedDigit)))

  return featureEmbed, predictedDigit

In [10]:
# We need the data with the labels
trainFeatureEmbed, trainPredictedDigit = getFeatureEmbedings(trainloader)

### **E) [6] Code: Match cluster labels to actual Fashion MNIST labels**

In [11]:
#In order to see how well k-means did we can use this supervised method of defining what a cluster is by seting the cluster label as the most common digits in that cluster
#unsupervised approaches include: manually selecting class depending on mean image
def retrieve_cluster_to_classification(cluster_labels,y_train):
  reference_labels = {}
# For loop to run through each label of cluster label
  for i in range(len(np.unique(kmeans.labels_))):
    index = np.where(cluster_labels == i,1,0)
    # we only read 0:NUM_CLASSES so we dont read the unknown labels
    num = np.bincount(y_train[index==1])[:NUM_CLASSES].argmax()
    reference_labels[i] = num
    # TODO: Right now the refrence label just maps to the majority label, should we also consider the 2nd and 3rd highest
  return reference_labels

### **E) [7] Code: Define retrained CNN function using predicted labels from clustering for unlabeled data**



In [12]:
def retrain_CNN_with_predicted_labels(train_dataset, my_labeled_trainset, my_unlabeled_trainset, unlabeled_train_loader, cluster_method):
  
  # remove all unlabeled labels from main training set, set them to NUM_CLASSES
  train_targets = np.array(train_dataset.targets.numpy(), copy=True)  
  train_targets[my_unlabeled_trainset.indices] = NUM_CLASSES
  
  # mapping from NUM_CLUSTERS to FASHION_MNIST classes
  reference_labels = retrieve_cluster_to_classification(
      cluster_method.labels_, 
      train_targets
  )

  unlabelTrainFeatureEmbed, _ = getFeatureEmbedings(unlabeled_train_loader)
  
  # Assign new labels
  predicted_test = cluster_method.predict(unlabelTrainFeatureEmbed)
  
  for i in range(unlabelTrainFeatureEmbed.shape[0]):
    my_unlabeled_trainset.dataset.targets[my_unlabeled_trainset.indices[i]] = reference_labels[predicted_test[i]]

  new_combined_data = torch.utils.data.ConcatDataset([my_labeled_trainset, my_unlabeled_trainset])
  new_combined_data_loader = DataLoader(new_combined_data, batch_size=128, shuffle=True)
  retrained_net = BasicNet()
  retrained_net.to(dev)
  result, retrained_net = train(NUM_EPOCHS, retrained_net, new_combined_data_loader)
  return retrained_net



#**Experiments**


###Experiment A) First we will create three experiments using diffirent ratios of M:N-M


1.   Ratio of 70% labeled to 30% unlabeled
2.   Ratio of 50% labeled to 50% unlabeled
3.   Ratio of 30% labeled to 70% unlabeled




In [13]:
# Experiment A: Diffirent Ratios of M to N

labeled_train_loader_70, unlabeled_train_loader_30, labeled_train_data_70, unlabeled_train_data_30 = splitTrainingData(trainset, 0.7)
labeled_train_loader_50, unlabeled_train_loader_50, labeled_train_data_50, unlabeled_train_data_50 = splitTrainingData(trainset, 0.5)
labeled_train_loader_30, unlabeled_train_loader_70, labeled_train_data_30, unlabeled_train_data_70 = splitTrainingData(trainset, 0.3)

###Experiment B) Then we will try training it with diffirent clustering methods

In [14]:
from sklearn.cluster import MiniBatchKMeans

kmeans = MiniBatchKMeans(n_clusters = NUM_CLUSTERS)

kmeans.fit(trainFeatureEmbed)

MiniBatchKMeans(n_clusters=30)

In [15]:
from sklearn.cluster import KMeans

gmm = KMeans(n_clusters = NUM_CLUSTERS)

gmm.fit(trainFeatureEmbed)

KMeans(n_clusters=30)

###Experiment C) Train all the models combining diffirent experiments

In [16]:
retrained_net_kmeans_70_label_30_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_70, unlabeled_train_data_30, unlabeled_train_loader_30, kmeans)

Epoch 1 / 6, Step 100/ 469 , Loss 1.4785804748535156
Epoch 1 / 6, Step 200/ 469 , Loss 1.2729737758636475
Epoch 1 / 6, Step 300/ 469 , Loss 1.3090766668319702
Epoch 1 / 6, Step 400/ 469 , Loss 1.3136640787124634
Epoch 2 / 6, Step 100/ 469 , Loss 1.2323434352874756
Epoch 2 / 6, Step 200/ 469 , Loss 1.279140591621399
Epoch 2 / 6, Step 300/ 469 , Loss 1.097141146659851
Epoch 2 / 6, Step 400/ 469 , Loss 1.091235876083374
Epoch 3 / 6, Step 100/ 469 , Loss 1.0787136554718018
Epoch 3 / 6, Step 200/ 469 , Loss 1.0675567388534546
Epoch 3 / 6, Step 300/ 469 , Loss 1.2230411767959595
Epoch 3 / 6, Step 400/ 469 , Loss 1.1279586553573608
Epoch 4 / 6, Step 100/ 469 , Loss 1.0751893520355225
Epoch 4 / 6, Step 200/ 469 , Loss 1.2589514255523682
Epoch 4 / 6, Step 300/ 469 , Loss 1.1153379678726196
Epoch 4 / 6, Step 400/ 469 , Loss 1.1287935972213745
Epoch 5 / 6, Step 100/ 469 , Loss 1.0029702186584473
Epoch 5 / 6, Step 200/ 469 , Loss 1.3196955919265747
Epoch 5 / 6, Step 300/ 469 , Loss 1.1173112392425

In [17]:
retrained_net_kmeans_50_label_50_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_50, unlabeled_train_data_50, unlabeled_train_loader_50, kmeans)

Epoch 1 / 6, Step 100/ 469 , Loss 1.4070624113082886
Epoch 1 / 6, Step 200/ 469 , Loss 1.1350146532058716
Epoch 1 / 6, Step 300/ 469 , Loss 1.2456623315811157
Epoch 1 / 6, Step 400/ 469 , Loss 1.3466253280639648
Epoch 2 / 6, Step 100/ 469 , Loss 1.0949289798736572
Epoch 2 / 6, Step 200/ 469 , Loss 1.088902235031128
Epoch 2 / 6, Step 300/ 469 , Loss 1.0283422470092773
Epoch 2 / 6, Step 400/ 469 , Loss 1.1369366645812988
Epoch 3 / 6, Step 100/ 469 , Loss 0.8658769726753235
Epoch 3 / 6, Step 200/ 469 , Loss 1.1358987092971802
Epoch 3 / 6, Step 300/ 469 , Loss 1.1506801843643188
Epoch 3 / 6, Step 400/ 469 , Loss 0.9587178826332092
Epoch 4 / 6, Step 100/ 469 , Loss 0.9824383854866028
Epoch 4 / 6, Step 200/ 469 , Loss 1.0999360084533691
Epoch 4 / 6, Step 300/ 469 , Loss 1.08925199508667
Epoch 4 / 6, Step 400/ 469 , Loss 1.1529611349105835
Epoch 5 / 6, Step 100/ 469 , Loss 1.1042420864105225
Epoch 5 / 6, Step 200/ 469 , Loss 1.1022045612335205
Epoch 5 / 6, Step 300/ 469 , Loss 0.9485915899276

In [18]:
retrained_net_kmeans_30_label_70_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_30, unlabeled_train_data_70, unlabeled_train_loader_70, kmeans)

Epoch 1 / 6, Step 100/ 469 , Loss 0.5977845788002014
Epoch 1 / 6, Step 200/ 469 , Loss 0.6786027550697327
Epoch 1 / 6, Step 300/ 469 , Loss 0.49654868245124817
Epoch 1 / 6, Step 400/ 469 , Loss 0.5311013460159302
Epoch 2 / 6, Step 100/ 469 , Loss 0.6410374045372009
Epoch 2 / 6, Step 200/ 469 , Loss 0.4915721118450165
Epoch 2 / 6, Step 300/ 469 , Loss 0.5669322609901428
Epoch 2 / 6, Step 400/ 469 , Loss 0.5070799589157104
Epoch 3 / 6, Step 100/ 469 , Loss 0.39156338572502136
Epoch 3 / 6, Step 200/ 469 , Loss 0.5083044767379761
Epoch 3 / 6, Step 300/ 469 , Loss 0.470933198928833
Epoch 3 / 6, Step 400/ 469 , Loss 0.36339080333709717
Epoch 4 / 6, Step 100/ 469 , Loss 0.44836822152137756
Epoch 4 / 6, Step 200/ 469 , Loss 0.4707953929901123
Epoch 4 / 6, Step 300/ 469 , Loss 0.49060937762260437
Epoch 4 / 6, Step 400/ 469 , Loss 0.4809212386608124
Epoch 5 / 6, Step 100/ 469 , Loss 0.47335031628608704
Epoch 5 / 6, Step 200/ 469 , Loss 0.4103602170944214
Epoch 5 / 6, Step 300/ 469 , Loss 0.51628

In [19]:
retrained_net_gmm_70_label_30_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_70, unlabeled_train_data_30, unlabeled_train_loader_30, gmm)

Epoch 1 / 6, Step 100/ 469 , Loss 0.474614679813385
Epoch 1 / 6, Step 200/ 469 , Loss 0.4131590723991394
Epoch 1 / 6, Step 300/ 469 , Loss 0.3221578299999237
Epoch 1 / 6, Step 400/ 469 , Loss 0.47874459624290466
Epoch 2 / 6, Step 100/ 469 , Loss 0.30274778604507446
Epoch 2 / 6, Step 200/ 469 , Loss 0.38786476850509644
Epoch 2 / 6, Step 300/ 469 , Loss 0.3099869191646576
Epoch 2 / 6, Step 400/ 469 , Loss 0.38634538650512695
Epoch 3 / 6, Step 100/ 469 , Loss 0.4012346863746643
Epoch 3 / 6, Step 200/ 469 , Loss 0.3499428927898407
Epoch 3 / 6, Step 300/ 469 , Loss 0.32617661356925964
Epoch 3 / 6, Step 400/ 469 , Loss 0.49075034260749817
Epoch 4 / 6, Step 100/ 469 , Loss 0.3801426589488983
Epoch 4 / 6, Step 200/ 469 , Loss 0.3631058931350708
Epoch 4 / 6, Step 300/ 469 , Loss 0.3931160271167755
Epoch 4 / 6, Step 400/ 469 , Loss 0.392886221408844
Epoch 5 / 6, Step 100/ 469 , Loss 0.4301944971084595
Epoch 5 / 6, Step 200/ 469 , Loss 0.3360289931297302
Epoch 5 / 6, Step 300/ 469 , Loss 0.278450

In [20]:
retrained_net_gmm_50_label_50_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_50, unlabeled_train_data_50, unlabeled_train_loader_50, gmm)

Epoch 1 / 6, Step 100/ 469 , Loss 0.3827483654022217
Epoch 1 / 6, Step 200/ 469 , Loss 0.5807194709777832
Epoch 1 / 6, Step 300/ 469 , Loss 0.3654506206512451
Epoch 1 / 6, Step 400/ 469 , Loss 0.36789625883102417
Epoch 2 / 6, Step 100/ 469 , Loss 0.5412696599960327
Epoch 2 / 6, Step 200/ 469 , Loss 0.5430489778518677
Epoch 2 / 6, Step 300/ 469 , Loss 0.4248042106628418
Epoch 2 / 6, Step 400/ 469 , Loss 0.2830982506275177
Epoch 3 / 6, Step 100/ 469 , Loss 0.3062753975391388
Epoch 3 / 6, Step 200/ 469 , Loss 0.3688565194606781
Epoch 3 / 6, Step 300/ 469 , Loss 0.30839142203330994
Epoch 3 / 6, Step 400/ 469 , Loss 0.4163661301136017
Epoch 4 / 6, Step 100/ 469 , Loss 0.3262276351451874
Epoch 4 / 6, Step 200/ 469 , Loss 0.23639681935310364
Epoch 4 / 6, Step 300/ 469 , Loss 0.37633031606674194
Epoch 4 / 6, Step 400/ 469 , Loss 0.248992919921875
Epoch 5 / 6, Step 100/ 469 , Loss 0.2677233815193176
Epoch 5 / 6, Step 200/ 469 , Loss 0.31718528270721436
Epoch 5 / 6, Step 300/ 469 , Loss 0.383190

In [21]:
retrained_net_gmm_30_label_70_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_30, unlabeled_train_data_70, unlabeled_train_loader_70, gmm)

Epoch 1 / 6, Step 100/ 469 , Loss 0.3068472445011139
Epoch 1 / 6, Step 200/ 469 , Loss 0.584756076335907
Epoch 1 / 6, Step 300/ 469 , Loss 0.42139360308647156
Epoch 1 / 6, Step 400/ 469 , Loss 0.4365386962890625
Epoch 2 / 6, Step 100/ 469 , Loss 0.4774197041988373
Epoch 2 / 6, Step 200/ 469 , Loss 0.5145040154457092
Epoch 2 / 6, Step 300/ 469 , Loss 0.37255585193634033
Epoch 2 / 6, Step 400/ 469 , Loss 0.4935983121395111
Epoch 3 / 6, Step 100/ 469 , Loss 0.2720433473587036
Epoch 3 / 6, Step 200/ 469 , Loss 0.4063916802406311
Epoch 3 / 6, Step 300/ 469 , Loss 0.3594999313354492
Epoch 3 / 6, Step 400/ 469 , Loss 0.2587115466594696
Epoch 4 / 6, Step 100/ 469 , Loss 0.5196161866188049
Epoch 4 / 6, Step 200/ 469 , Loss 0.40496110916137695
Epoch 4 / 6, Step 300/ 469 , Loss 0.36468905210494995
Epoch 4 / 6, Step 400/ 469 , Loss 0.38384923338890076
Epoch 5 / 6, Step 100/ 469 , Loss 0.36963167786598206
Epoch 5 / 6, Step 200/ 469 , Loss 0.31178006529808044
Epoch 5 / 6, Step 300/ 469 , Loss 0.3828

#**Results**


###First we will look at the **baseline** results from the CNN trained on M labeled images

In [22]:
testFeatureEmbed, _ = getFeatureEmbedings(testloader)

In [23]:
baseline_accuracy, loss = test(testloader, net)
print(f"Our accuracy with just the CNN is: {baseline_accuracy} on test set from training on trainset" )

Our accuracy with just the CNN is: 0.8463 on test set from training on trainset


### Next we will look at all the other methods that use weakly supervised learning, we aim to be better than the baseline in each


In [34]:
accuracy_kmeans_70_label_30_not, loss = test(testloader, retrained_net_kmeans_70_label_30_not)
print(f"Our accuracy with the CNN and Mini-Batch Kmeans clustering with 70% labeled, 30% unlabeled is: {accuracy_kmeans_70_label_30_not} on test set" )

Our accuracy with the CNN and Mini-Batch Kmeans clustering with 70% labeled, 30% unlabeled is: 0.8666 on test set


In [35]:
accuracy_kmeans_50_label_50_not, loss = test(testloader, retrained_net_kmeans_50_label_50_not)
print(f"Our accuracy with the CNN and Mini-Batch Kmeans clustering with 50% labeled, 50% unlabeled is: {accuracy_kmeans_50_label_50_not} on test set" )

Our accuracy with the CNN and Mini-Batch Kmeans clustering with 50% labeled, 50% unlabeled is: 0.1824 on test set


In [36]:
accuracy_kmeans_30_label_70_not, loss = test(testloader, retrained_net_kmeans_30_label_70_not)
print(f"Our accuracy with the CNN and Mini-Batch Kmeans clustering with 30% labeled, 70% unlabeled is: {accuracy_kmeans_30_label_70_not} on test set" )

Our accuracy with the CNN and Mini-Batch Kmeans clustering with 30% labeled, 70% unlabeled is: 0.1001 on test set


In [37]:
accuracy_gmm_70_label_30_not, loss = test(testloader, retrained_net_gmm_70_label_30_not)
print(f"Our accuracy with the CNN and Kmeans clustering with 70% labeled, 30% unlabeled is: {accuracy_gmm_70_label_30_not} on test set from training on trainset" )

Our accuracy with the CNN and Kmeans clustering with 70% labeled, 30% unlabeled is: 0.1002 on test set from training on trainset


In [38]:
accuracy_gmm_50_label_50_not, loss = test(testloader, retrained_net_gmm_50_label_50_not)
print(f"Our accuracy with the CNN and Kmeans clustering with 50% labeled, 50% unlabeled is: {accuracy_gmm_50_label_50_not} on test set from training on trainset" )

Our accuracy with the CNN and Kmeans clustering with 50% labeled, 50% unlabeled is: 0.1 on test set from training on trainset


In [39]:
accuracy_gmm_30_label_70_not, loss = test(testloader, retrained_net_gmm_30_label_70_not)
print(f"Our accuracy with the CNN and Kmeans clustering with 30% labeled, 70% unlabeled is: {accuracy_gmm_30_label_70_not} on test set from training on trainset" )

Our accuracy with the CNN and Kmeans clustering with 30% labeled, 70% unlabeled is: 0.1 on test set from training on trainset


#**Conclusions**

One positive result from our experiments is that we found Mini-Batch Kmeans with a ratio of 70% labeled, 30% unlabeled yielded a 2% increase in accuracy compared to the baseline. However we saw that all other ratios and clustering methods provided a < 20% accuracy rate. Mini Batch K-means in general was more accurate than just pure k-means.


The usefulness of ratios where unlabeled was equal to or greater that would have been more of a useful result, but an improvement from 30% unlabeled to 
86.6% accuracy is not a total loss.


We were limited in which clustering methods we could use due to RAM capacity and learning times. GMM was used originally but it would take 2 hours to train with a GPU, and would often crash. DBSCAN, and OPTICS models were also attempted but faced similar issues. 


One reason we believe that the results weren't as strong as we expected is that we randomly split the data into M% labeled and (N-M)% unlabeled, without taking into consideration the labels of the data. This could have lead to hotspots of certain label types being placed in the unlabeled set, leading to poor classification results.


Another reason we believe that could explain the result, is that we performed clustering on the feature emebedings of the images. Meaning that after we trained our original CNN on M labeled images, we used it to get a vector of size 4096 from each image in the training set. We performed clustering on the feature embeded version of each image, and 4096 might have been too small of a feature to encode all the data of an image required to cluster it properly (although in prior tests we reached 86% accuracy just from clustering using a full labeled N size dataset).

Lastly we believe that the way that we assigned labels to the unlabeled images might be improved. We used the mode label in each cluster (from the labeled images in the cluster). We believe that in the future we should have only labeled images with a probability greater than a certain threshold as the label, as there were 30 clusters, and noisy/inacurate labels could have been introduced.