<a href="https://colab.research.google.com/github/HardikPaliwal/CS484Proj/blob/master/cs484%20proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CS484 Final Project**

#### Topic 6: Weakly supervised classification

#### Hardik Paliwal (20725413), Lance Pereira (20719626)

______________________________________________________________

#**Table Of Contents**
- A) Abstract
- B) High Level Goals and Methedology
- C) Team Members and Contributions
- D) External Code Libraries
- E) Code
  - 1) Setup (imports, loading data)
  - 2) Define our Base CNN (Based on VGG11), Test, Train methods
  - 3) Split Fashion MNIST training data into M labeled images, and N-M unlabeled
  - 4) Train Base CNN on M labeled training images
  - 5) Use the 'semi' trained CNN to gain important features (feature embedings) of N-M *unlabeled* training images
  - 6) Match cluster labels to actual Fashion MNIST labels
  - 7) Define retrained CNN function using predicted labels from clustering for unlabeled data
- F) Experiments
  - 1) Diffirent ratios of M/N
  - 2) Diffirent Clustering Methods
- G) Results
- H) Conclusions

#**A) Abstract:**

For our project we have decided to do Project 6, choosing specifically Fashion MNIST. We use weakly supervised classification using feature extraction (embedings similar to a PCA) along with clustering methods to try and improve results compared to just using supervised learning.
 

#**B) High Level Goals and Methedology:**

Out high level goal is to use weakly supervised classification to improve our 
prediction ability compared to just training using labeled images. We hope to 
achieve atleast a 5% increase in our test prediction score using clustering 
methods such as Kmeans and Mini-Batch Kmeans. Our attempts to use other clustering methods were stopped by our limited RAM capacity, but Kmeans and Mini-Batch Kmeans are quite effective.

Our method is split into two experiments, 
- we first will test to see training
a CNN on M labeled images, then use a cluestering method to classify all the 
N images, using the majority label in each cluster as the predicted label for the unlabeled N-M images
- secondly we will try using the predicted labels from the previous step to 
train a new CNN model, to see if it performs better than the original CNN model trained on just the M labeled images

Our baseline will be a simple CNN  trained on the M labeled images. We will use this to see if our weakly supervised method improved the accuracy rate. 

#**C) Team Members and Contributions**

- Hardik Paliwal
  - Created CNN based on VGG11 for training
  - Created function to gather features (feature embedings) from pretrained CNN (trained on M labeled images)
  - Created method to get predicted labels from clustering methods
  - Created train,test methods

- Lance Pereira
  - Create function to retrain CNN using clustered labels
  - Modified clustering prediction methods to use only labeled training data
  - Created experiments for diffirent ratios of M:M-N
  - Created function to split data into labeled and unlabeled
  - Created experiments for diffirent clustering methods
    - Mini Batch Kmeans
    - Kmeans
  - Wrote/formated report



#**D) External Code Libraries**

We used

- Pytorch
  - Because it was crucial for the quick training of our models
  - Allowed us to not have to deal with calculating back propogation
- Sklearn
  - Provided us a large range of clustering methods for quick experimentation
- Numpy
  - Useful for large matrix operations

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision as tv
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torch.autograd import Variable
import sklearn
from torch import optim
import numpy as np
import itertools

%matplotlib inline

# **E) Code**

### **E) [1] Code: Setup**

The below cells:
- Import Fshion MNIST
- define functions that strip label from data

In [None]:
# Constants
dev=torch.device("cuda") 
NUM_EPOCHS = 6
NUM_CLUSTERS = 30
NUM_CLASSES = 10
UNKNOWN_CLASS = 11

In [None]:
trainset = tv.datasets.FashionMNIST(root="./", download=True,train=True,  transform=tv.transforms.Compose(
    [tv.transforms.Resize(32), tv.transforms.ToTensor()]))
trainloader = DataLoader(trainset, batch_size=128, shuffle=True)

testset = tv.datasets.FashionMNIST(root="./", download=True,train=False,  transform=tv.transforms.Compose(
    [tv.transforms.Resize(32), tv.transforms.ToTensor()]))
testloader = DataLoader(testset, batch_size=128, shuffle=True)

### **E) [2] Code: Define our Base CNN (Based on VGG11), Test, Train methods**

In [None]:
#This is an implementation of VGG11 (which is a precursor to VGG16) for mnist dataset.
# it also takes in n, which is the number of classes. N+1 class stands for unknown. 

#this will let us differeniate the unlabelled data from the labelled data
class BasicNet(nn.Module):
    def __init__(self, n=9):
        super(BasicNet, self).__init__()
        self.batchNorm = [nn.BatchNorm2d(64), nn.BatchNorm2d(128),nn.BatchNorm2d(256), nn.BatchNorm2d(256),
                          nn.BatchNorm2d(512), nn.BatchNorm2d(512), nn.BatchNorm2d(512), nn.BatchNorm2d(512)]
        self.conv = [
        nn.Conv2d(1, 64, 3, 1, 1) ,nn.Conv2d(64, 128, 3, 1, 1), nn.Conv2d(128, 256, 3, 1, 1), nn.Conv2d(256, 256, 3, 1, 1)
       ,nn.Conv2d(256, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1)
        ]
        maxPool = nn.MaxPool2d(2, stride=2)
        self.conv1 = nn.Sequential(self.conv[0], self.batchNorm[0], nn.ReLU(), maxPool)
        self.conv2 = nn.Sequential(self.conv[1], self.batchNorm[1], nn.ReLU(), maxPool)
        self.conv3 = nn.Sequential(self.conv[2], self.batchNorm[2], nn.ReLU())
        self.conv4 = nn.Sequential(self.conv[3], self.batchNorm[3], nn.ReLU(), maxPool) 
        self.conv5 = nn.Sequential(self.conv[4], self.batchNorm[4], nn.ReLU())
        self.conv6 = nn.Sequential(self.conv[5], self.batchNorm[5], nn.ReLU(), maxPool)
        self.conv7 = nn.Sequential(self.conv[6], self.batchNorm[6], nn.ReLU()) 
        self.conv8 = nn.Sequential(self.conv[7], self.batchNorm[7], nn.ReLU(), maxPool)
        self.fc1 = nn.Linear(512, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, n+1)
        
    def forward(self, x, feature_embedding=False):
        dropOut = nn.Dropout(p=0.5)

        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.conv5(x)
        x = self.conv6(x)
        x = self.conv7(x)
        x = self.conv8(x)
        x = torch.flatten(x, 1)
        
        x = dropOut(F.relu(self.fc1(x)))
        x = dropOut(F.relu(self.fc2(x)))
        if(feature_embedding):
          return x
        #Not sure why this works without softmax. probably a reasoning givin in the paper (cause the output can range from anything (not normalized to a 0-1 probability range))
        x = self.fc3(x)
        return x

In [None]:
def test(data, net):
    net.eval()
    loss_func = nn.CrossEntropyLoss()

    total_correct = 0
    total_loss = 0
    with torch.no_grad():
        correct = 0
        total = 0
        for i, (images, labels) in enumerate(data):
            images= images.to(dev)
            labels = labels.to(dev)
            test_pred = net(images)

            pred = torch.max(test_pred, 1)[1].data.squeeze()
            total_correct+= (pred == labels).sum().item()
            loss = loss_func( test_pred, labels)
            total_loss+= loss.item()*images.size(0)
        # return total_correct/len(data.dataset), total_loss/len(data.dataset)
        # Note From Lance: We shouldn't divide the total loss
        return total_correct/len(data.dataset), total_loss
  
def train(num_epochs, net, trainloader):
    optimizer = optim.SGD(net.parameters(), lr=0.01)
    loss_func = nn.CrossEntropyLoss()

    accuracy_through_epochs = []
    total_step = len(trainloader)
    
    for epoch in range(num_epochs):
        net.train()
        for i, (images, labels) in enumerate(trainloader):
            images= images.to(dev)
            labels = labels.to(dev)
            optimizer.zero_grad()           
            prediction = net(images)
            loss = loss_func( prediction, labels)
            loss.backward()
            optimizer.step()
            if ((i +1) % 100 == 0):
                print(f"Epoch {epoch+1} / {num_epochs}, Step {i+1}/ {total_step} , Loss {loss.item()}")

    return accuracy_through_epochs, net

### **E) [3] Code: Split Fashion MNIST training data into M labeled images, and N-M unlabeled**

In [None]:
#Modifies dataset in place to only have values correspounding to the labels in classesToUse
def splitTrainingData(training_data, M_percent):
  # N is len(training_data)
  len_N = len(trainset)

  # M is the number of labeled images we want
  len_M = int(M_percent*len_N)
  
  labeled_data, unlabeled_data = torch.utils.data.random_split(trainset, [len_M, len_N - len_M])

  # strip the labels from unlabeled_data
  # unlabeled_data.dataset.targets[unlabeled_data.indices] = UNKNOWN_CLASS

  labeled_data_loader = DataLoader(labeled_data, batch_size=128, shuffle=True)
  unlabeled_data_loader = DataLoader(unlabeled_data, batch_size=128, shuffle=True)

  return labeled_data_loader, unlabeled_data_loader, labeled_data, unlabeled_data


In [None]:
labeled_train_loader, unlabeled_train_loader, labeled_train_data, unlabeled_train_data = splitTrainingData(trainset, 0.7)

### **E) [4] Code: Train Base CNN on M labeled training images**

In [None]:
net = BasicNet()
net.to(dev)
result, trained_net = train(NUM_EPOCHS, net, labeled_train_loader)

Epoch 1 / 6, Step 100/ 329 , Loss 0.4691801965236664
Epoch 1 / 6, Step 200/ 329 , Loss 0.4827142059803009
Epoch 1 / 6, Step 300/ 329 , Loss 0.38404151797294617
Epoch 2 / 6, Step 100/ 329 , Loss 0.24493633210659027
Epoch 2 / 6, Step 200/ 329 , Loss 0.2605207860469818
Epoch 2 / 6, Step 300/ 329 , Loss 0.2828778028488159
Epoch 3 / 6, Step 100/ 329 , Loss 0.2332649976015091
Epoch 3 / 6, Step 200/ 329 , Loss 0.2661450207233429
Epoch 3 / 6, Step 300/ 329 , Loss 0.26624417304992676
Epoch 4 / 6, Step 100/ 329 , Loss 0.1302751898765564
Epoch 4 / 6, Step 200/ 329 , Loss 0.196417436003685
Epoch 4 / 6, Step 300/ 329 , Loss 0.23519733548164368
Epoch 5 / 6, Step 100/ 329 , Loss 0.24351318180561066
Epoch 5 / 6, Step 200/ 329 , Loss 0.07862726598978043
Epoch 5 / 6, Step 300/ 329 , Loss 0.20430561900138855
Epoch 6 / 6, Step 100/ 329 , Loss 0.10598327964544296
Epoch 6 / 6, Step 200/ 329 , Loss 0.14645934104919434
Epoch 6 / 6, Step 300/ 329 , Loss 0.09488426148891449


### **E) [5] Code: Use the 'semi' trained CNN to gain important features (feature embedings) of N-M *unlabeled* training images**

In [None]:
#do this to store the results in 1 numpy array of 10000 images vs like 20 batches of size 128 images. 
#get memory error when doing it on a batch of size len(trainloader), so we have to combine the results for trainloader

def getFeatureEmbedings(dataloader):
  featureEmbed = []
  predictedDigit = []

  with torch.no_grad():
    for i, (images, labels) in enumerate(dataloader):
        images= images.to(dev)
        labels = labels.to(dev)
        featureEmbed.append(net(images, feature_embedding=True).to("cpu").numpy())
        pred = net(images)
        predictedDigit.append(torch.max(pred, 1)[1].data.squeeze().to("cpu").numpy())

  # flatten lists
  featureEmbed = np.array(list(itertools.chain(*featureEmbed)))
  predictedDigit = np.array(list(itertools.chain(*predictedDigit)))

  return featureEmbed, predictedDigit

In [None]:
# We need the data with the labels
trainFeatureEmbed, trainPredictedDigit = getFeatureEmbedings(trainloader)

### **E) [6] Code: Match cluster labels to actual Fashion MNIST labels**

In [None]:
#In order to see how well k-means did we can use this supervised method of defining what a cluster is by seting the cluster label as the most common digits in that cluster
#unsupervised approaches include: manually selecting class depending on mean image
def retrieve_cluster_to_classification(cluster_labels,y_train):
  reference_labels = {}
# For loop to run through each label of cluster label
  for i in range(len(np.unique(kmeans.labels_))):
    index = np.where(cluster_labels == i,1,0)
    # we only read 0:NUM_CLASSES so we dont read the unknown labels
    num = np.bincount(y_train[index==1])[:NUM_CLASSES].argmax()
    reference_labels[i] = num
    # TODO: Right now the refrence label just maps to the majority label, should we also consider the 2nd and 3rd highest
  return reference_labels

### **E) [7] Code: Define retrained CNN function using predicted labels from clustering for unlabeled data**



In [None]:
def retrain_CNN_with_predicted_labels(train_dataset, my_labeled_trainset, my_unlabeled_trainset, unlabeled_train_loader, cluster_method):
  
  # remove all unlabeled labels from main training set, set them to NUM_CLASSES
  train_targets = np.array(train_dataset.targets.numpy(), copy=True)  
  train_targets[my_unlabeled_trainset.indices] = NUM_CLASSES
  
  # mapping from NUM_CLUSTERS to FASHION_MNIST classes
  reference_labels = retrieve_cluster_to_classification(
      cluster_method.labels_, 
      train_targets
  )

  unlabelTrainFeatureEmbed, _ = getFeatureEmbedings(unlabeled_train_loader)
  
  # Assign new labels
  predicted_test = cluster_method.predict(unlabelTrainFeatureEmbed)
  
  for i in range(unlabelTrainFeatureEmbed.shape[0]):
    my_unlabeled_trainset.dataset.targets[my_unlabeled_trainset.indices[i]] = reference_labels[predicted_test[i]]

  new_combined_data = torch.utils.data.ConcatDataset([my_labeled_trainset, my_unlabeled_trainset])
  new_combined_data_loader = DataLoader(new_combined_data, batch_size=128, shuffle=True)
  retrained_net = BasicNet()
  retrained_net.to(dev)
  result, retrained_net = train(NUM_EPOCHS, retrained_net, new_combined_data_loader)
  return retrained_net



#**Experiments**


###Experiment A) First we will create three experiments using diffirent ratios of M:N-M


1.   Ratio of 70% labeled to 30% unlabeled
2.   Ratio of 50% labeled to 50% unlabeled
3.   Ratio of 30% labeled to 70% unlabeled




In [None]:
# Experiment A: Diffirent Ratios of M to N

labeled_train_loader_70, unlabeled_train_loader_30, labeled_train_data_70, unlabeled_train_data_30 = splitTrainingData(trainset, 0.7)
labeled_train_loader_50, unlabeled_train_loader_50, labeled_train_data_50, unlabeled_train_data_50 = splitTrainingData(trainset, 0.5)
labeled_train_loader_30, unlabeled_train_loader_70, labeled_train_data_30, unlabeled_train_data_70 = splitTrainingData(trainset, 0.3)

###Experiment B) Then we will try training it with diffirent clustering methods

In [None]:
from sklearn.cluster import MiniBatchKMeans

kmeans = MiniBatchKMeans(n_clusters = NUM_CLUSTERS)

kmeans.fit(trainFeatureEmbed)

MiniBatchKMeans(n_clusters=30)

In [None]:
from sklearn.cluster import KMeans

gmm = KMeans(n_clusters = NUM_CLUSTERS)

gmm.fit(trainFeatureEmbed)

KMeans(n_clusters=30)

###Experiment C) Train all the models combining diffirent experiments

In [None]:
retrained_net_kmeans_70_label_30_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_70, unlabeled_train_data_30, unlabeled_train_loader_30, kmeans)

Epoch 1 / 6, Step 100/ 469 , Loss 1.448676586151123
Epoch 1 / 6, Step 200/ 469 , Loss 1.4181629419326782
Epoch 1 / 6, Step 300/ 469 , Loss 1.4387338161468506
Epoch 1 / 6, Step 400/ 469 , Loss 1.294232726097107
Epoch 2 / 6, Step 100/ 469 , Loss 1.1496649980545044
Epoch 2 / 6, Step 200/ 469 , Loss 1.2088148593902588
Epoch 2 / 6, Step 300/ 469 , Loss 0.9243248105049133
Epoch 2 / 6, Step 400/ 469 , Loss 1.2810783386230469
Epoch 3 / 6, Step 100/ 469 , Loss 1.2413318157196045
Epoch 3 / 6, Step 200/ 469 , Loss 1.2971490621566772
Epoch 3 / 6, Step 300/ 469 , Loss 1.2465242147445679
Epoch 3 / 6, Step 400/ 469 , Loss 1.2370727062225342
Epoch 4 / 6, Step 100/ 469 , Loss 0.8841609954833984
Epoch 4 / 6, Step 200/ 469 , Loss 0.988118588924408
Epoch 4 / 6, Step 300/ 469 , Loss 0.9165753722190857
Epoch 4 / 6, Step 400/ 469 , Loss 1.0912517309188843
Epoch 5 / 6, Step 100/ 469 , Loss 1.1586616039276123
Epoch 5 / 6, Step 200/ 469 , Loss 1.0676538944244385
Epoch 5 / 6, Step 300/ 469 , Loss 1.0175216197967

In [None]:
retrained_net_kmeans_50_label_50_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_50, unlabeled_train_data_50, unlabeled_train_loader_50, kmeans)

Epoch 1 / 6, Step 100/ 469 , Loss 1.2840626239776611
Epoch 1 / 6, Step 200/ 469 , Loss 1.3736687898635864
Epoch 1 / 6, Step 300/ 469 , Loss 1.184698462486267
Epoch 1 / 6, Step 400/ 469 , Loss 1.0110880136489868
Epoch 2 / 6, Step 100/ 469 , Loss 0.8944224119186401
Epoch 2 / 6, Step 200/ 469 , Loss 1.1778855323791504
Epoch 2 / 6, Step 300/ 469 , Loss 0.9677593111991882
Epoch 2 / 6, Step 400/ 469 , Loss 1.1191428899765015
Epoch 3 / 6, Step 100/ 469 , Loss 1.121232032775879
Epoch 3 / 6, Step 200/ 469 , Loss 1.0038708448410034
Epoch 3 / 6, Step 300/ 469 , Loss 1.0726529359817505
Epoch 3 / 6, Step 400/ 469 , Loss 0.9662225842475891
Epoch 4 / 6, Step 100/ 469 , Loss 0.98355633020401
Epoch 4 / 6, Step 200/ 469 , Loss 1.0350593328475952
Epoch 4 / 6, Step 300/ 469 , Loss 0.9326592087745667
Epoch 4 / 6, Step 400/ 469 , Loss 0.9926052093505859
Epoch 5 / 6, Step 100/ 469 , Loss 1.0707937479019165
Epoch 5 / 6, Step 200/ 469 , Loss 1.0367767810821533
Epoch 5 / 6, Step 300/ 469 , Loss 0.89864164590835

In [None]:
retrained_net_kmeans_30_label_70_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_30, unlabeled_train_data_70, unlabeled_train_loader_70, kmeans)

Epoch 1 / 6, Step 100/ 469 , Loss 0.649860680103302
Epoch 1 / 6, Step 200/ 469 , Loss 0.4560025930404663
Epoch 1 / 6, Step 300/ 469 , Loss 0.5309317111968994
Epoch 1 / 6, Step 400/ 469 , Loss 0.7394914627075195
Epoch 2 / 6, Step 100/ 469 , Loss 0.4349050521850586
Epoch 2 / 6, Step 200/ 469 , Loss 0.551249086856842
Epoch 2 / 6, Step 300/ 469 , Loss 0.6444259881973267
Epoch 2 / 6, Step 400/ 469 , Loss 0.39055222272872925
Epoch 3 / 6, Step 100/ 469 , Loss 0.43510597944259644
Epoch 3 / 6, Step 200/ 469 , Loss 0.3971916735172272
Epoch 3 / 6, Step 300/ 469 , Loss 0.5364770293235779
Epoch 3 / 6, Step 400/ 469 , Loss 0.4727073013782501
Epoch 4 / 6, Step 100/ 469 , Loss 0.49300462007522583
Epoch 4 / 6, Step 200/ 469 , Loss 0.359038770198822
Epoch 4 / 6, Step 300/ 469 , Loss 0.4499584436416626
Epoch 4 / 6, Step 400/ 469 , Loss 0.4876686632633209
Epoch 5 / 6, Step 100/ 469 , Loss 0.4338257312774658
Epoch 5 / 6, Step 200/ 469 , Loss 0.6371221542358398
Epoch 5 / 6, Step 300/ 469 , Loss 0.5640937685

In [None]:
retrained_net_gmm_70_label_30_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_70, unlabeled_train_data_30, unlabeled_train_loader_30, gmm)

Epoch 1 / 6, Step 100/ 469 , Loss 0.41591203212738037
Epoch 1 / 6, Step 200/ 469 , Loss 0.5276227593421936
Epoch 1 / 6, Step 300/ 469 , Loss 0.4982303977012634
Epoch 1 / 6, Step 400/ 469 , Loss 0.42930495738983154
Epoch 2 / 6, Step 100/ 469 , Loss 0.34958839416503906
Epoch 2 / 6, Step 200/ 469 , Loss 0.422457218170166
Epoch 2 / 6, Step 300/ 469 , Loss 0.37986811995506287
Epoch 2 / 6, Step 400/ 469 , Loss 0.28429117798805237
Epoch 3 / 6, Step 100/ 469 , Loss 0.4221511483192444
Epoch 3 / 6, Step 200/ 469 , Loss 0.30540093779563904
Epoch 3 / 6, Step 300/ 469 , Loss 0.2792526185512543
Epoch 3 / 6, Step 400/ 469 , Loss 0.3353959619998932
Epoch 4 / 6, Step 100/ 469 , Loss 0.2456040233373642
Epoch 4 / 6, Step 200/ 469 , Loss 0.34422191977500916
Epoch 4 / 6, Step 300/ 469 , Loss 0.4524536728858948
Epoch 4 / 6, Step 400/ 469 , Loss 0.3071324825286865
Epoch 5 / 6, Step 100/ 469 , Loss 0.48587071895599365
Epoch 5 / 6, Step 200/ 469 , Loss 0.3435817360877991
Epoch 5 / 6, Step 300/ 469 , Loss 0.395

In [None]:
retrained_net_gmm_50_label_50_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_50, unlabeled_train_data_50, unlabeled_train_loader_50, gmm)

Epoch 1 / 6, Step 100/ 469 , Loss 0.7618829607963562
Epoch 1 / 6, Step 200/ 469 , Loss 0.6318556666374207
Epoch 1 / 6, Step 300/ 469 , Loss 0.3435537815093994
Epoch 1 / 6, Step 400/ 469 , Loss 0.6657202839851379
Epoch 2 / 6, Step 100/ 469 , Loss 0.49679049849510193
Epoch 2 / 6, Step 200/ 469 , Loss 0.31751179695129395
Epoch 2 / 6, Step 300/ 469 , Loss 0.3622730076313019
Epoch 2 / 6, Step 400/ 469 , Loss 0.411258339881897
Epoch 3 / 6, Step 100/ 469 , Loss 0.33245041966438293
Epoch 3 / 6, Step 200/ 469 , Loss 0.29311832785606384
Epoch 3 / 6, Step 300/ 469 , Loss 0.28612399101257324
Epoch 3 / 6, Step 400/ 469 , Loss 0.36015889048576355
Epoch 4 / 6, Step 100/ 469 , Loss 0.24244891107082367
Epoch 4 / 6, Step 200/ 469 , Loss 0.4388328194618225
Epoch 4 / 6, Step 300/ 469 , Loss 0.31226906180381775
Epoch 4 / 6, Step 400/ 469 , Loss 0.32186946272850037
Epoch 5 / 6, Step 100/ 469 , Loss 0.2818697988986969
Epoch 5 / 6, Step 200/ 469 , Loss 0.3400105834007263
Epoch 5 / 6, Step 300/ 469 , Loss 0.30

In [None]:
retrained_net_gmm_30_label_70_not = retrain_CNN_with_predicted_labels(trainset, labeled_train_data_30, unlabeled_train_data_70, unlabeled_train_loader_70, gmm)

Epoch 1 / 6, Step 100/ 469 , Loss 0.595768392086029
Epoch 1 / 6, Step 200/ 469 , Loss 0.3956579566001892
Epoch 1 / 6, Step 300/ 469 , Loss 0.51374351978302
Epoch 1 / 6, Step 400/ 469 , Loss 0.42467403411865234
Epoch 2 / 6, Step 100/ 469 , Loss 0.4202234148979187
Epoch 2 / 6, Step 200/ 469 , Loss 0.4511590600013733
Epoch 2 / 6, Step 300/ 469 , Loss 0.4215220510959625
Epoch 2 / 6, Step 400/ 469 , Loss 0.2993345260620117
Epoch 3 / 6, Step 100/ 469 , Loss 0.3317875266075134
Epoch 3 / 6, Step 200/ 469 , Loss 0.32873404026031494
Epoch 3 / 6, Step 300/ 469 , Loss 0.4781407117843628
Epoch 3 / 6, Step 400/ 469 , Loss 0.3674331307411194
Epoch 4 / 6, Step 100/ 469 , Loss 0.3013722002506256
Epoch 4 / 6, Step 200/ 469 , Loss 0.2668156921863556
Epoch 4 / 6, Step 300/ 469 , Loss 0.36485666036605835
Epoch 4 / 6, Step 400/ 469 , Loss 0.3173113167285919
Epoch 5 / 6, Step 100/ 469 , Loss 0.32417914271354675
Epoch 5 / 6, Step 200/ 469 , Loss 0.3937910795211792
Epoch 5 / 6, Step 300/ 469 , Loss 0.459737688

#**Results**


###First we will look at the **baseline** results from the CNN trained on M labeled images

In [None]:
testFeatureEmbed, _ = getFeatureEmbedings(testloader)

In [None]:
baseline_accuracy, loss = test(testloader, net)
print(f"Our accuracy with just the CNN is: {baseline_accuracy} on test set from training on trainset" )

Our accuracy with just the CNN is: 0.808 on test set from training on trainset


### Next we will look at all the other methods that use weakly supervised learning, we aim to be better than the baseline in each


In [None]:
accuracy_kmeans_70_label_30_not, loss = test(testloader, retrained_net_kmeans_70_label_30_not)
print(f"Our accuracy with the CNN and Mini-Batch Kmeans clustering with 70% labeled, 30% unlabeled is: {accuracy_kmeans_70_label_30_not} on test set" )

Our accuracy with the CNN and Mini-Batch Kmeans clustering with 70% labeled, 30% unlabeled is: 0.8542 on test set


In [None]:
accuracy_kmeans_50_label_50_not, loss = test(testloader, retrained_net_kmeans_50_label_50_not)
print(f"Our accuracy with the CNN and Mini-Batch Kmeans clustering with 50% labeled, 50% unlabeled is: {accuracy_kmeans_50_label_50_not} on test set" )

Our accuracy with the CNN and Mini-Batch Kmeans clustering with 50% labeled, 50% unlabeled is: 0.2156 on test set


In [None]:
accuracy_kmeans_30_label_70_not, loss = test(testloader, retrained_net_kmeans_30_label_70_not)
print(f"Our accuracy with the CNN and Mini-Batch Kmeans clustering with 30% labeled, 70% unlabeled is: {accuracy_kmeans_30_label_70_not} on test set" )

Our accuracy with the CNN and Mini-Batch Kmeans clustering with 30% labeled, 70% unlabeled is: 0.1 on test set


In [None]:
accuracy_gmm_70_label_30_not, loss = test(testloader, retrained_net_gmm_70_label_30_not)
print(f"Our accuracy with the CNN and Kmeans clustering with 70% labeled, 30% unlabeled is: {accuracy_gmm_70_label_30_not} on test set from training on trainset" )

Our accuracy with the CNN and Kmeans clustering with 70% labeled, 30% unlabeled is: 0.1 on test set from training on trainset


In [None]:
accuracy_gmm_50_label_50_not, loss = test(testloader, retrained_net_gmm_50_label_50_not)
print(f"Our accuracy with the CNN and Kmeans clustering with 50% labeled, 50% unlabeled is: {accuracy_gmm_50_label_50_not} on test set from training on trainset" )

Our accuracy with the CNN and Kmeans clustering with 50% labeled, 50% unlabeled is: 0.1 on test set from training on trainset


In [None]:
accuracy_gmm_30_label_70_not, loss = test(testloader, retrained_net_gmm_30_label_70_not)
print(f"Our accuracy with the CNN and Kmeans clustering with 30% labeled, 70% unlabeled is: {accuracy_gmm_30_label_70_not} on test set from training on trainset" )

Our accuracy with the CNN and Kmeans clustering with 30% labeled, 70% unlabeled is: 0.1001 on test set from training on trainset


#**Conclusions**

One positive result from our experiments is that we found Mini-Batch Kmeans with a ratio of 70% labeled, 30% unlabeled yielded a 5% increase in accuracy compared to the baseline. However we saw that all other ratios and clustering methods provided a < 20% accuracy rate. Mini Batch K-means in general was more accurate than just pure k-means.


The usefulness of ratios where unlabeled was equal to or greater that would have been more of a useful result, but an improvement from 30% unlabeled to 
86.6% accuracy is not a total loss.


We were limited in which clustering methods we could use due to RAM capacity and learning times. GMM was used originally but it would take 2 hours to train with a GPU, and would often crash. DBSCAN, and OPTICS models were also attempted but faced similar issues. 


One reason we believe that the results weren't as strong as we expected is that we randomly split the data into M% labeled and (N-M)% unlabeled, without taking into consideration the labels of the data. This could have lead to hotspots of certain label types being placed in the unlabeled set, leading to poor classification results.


Another reason we believe that could explain the result, is that we performed clustering on the feature emebedings of the images. Meaning that after we trained our original CNN on M labeled images, we used it to get a vector of size 4096 from each image in the training set. We performed clustering on the feature embeded version of each image, and 4096 might have been too small of a feature to encode all the data of an image required to cluster it properly (although in prior tests we reached 86% accuracy just from clustering using a full labeled N size dataset).

Lastly we believe that the way that we assigned labels to the unlabeled images might be improved. We used the mode label in each cluster (from the labeled images in the cluster). We believe that in the future we should have only labeled images with a probability greater than a certain threshold as the label, as there were 30 clusters, and noisy/inacurate labels could have been introduced.