<a href="https://colab.research.google.com/github/HardikPaliwal/CS484Proj/blob/master/cs484%20proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CS484 Final Project**

#### Topic 6: Weakly supervised classification

#### Hardik Paliwal (20725413), Lance Pereira (20719626)

______________________________________________________________

#**Table Of Contents**
- A) Abstract
- B) High Level Goals and Methedology
- C) Team Members and Contributions
- D) External Code Libraries
- E) Code
  - 1) Setup (imports, loading data)
  - 2) Our CNN (Based on VGG11)
  - 3) Clustering Methods from SKlearn 
- F) Experiments
  - 1) Diffirent ratios of M/N
  - 2) Diffirent Clustering Methods
  - 3) Clustering as Final vs Retrained CNN
- G) Results
  - 1) Results of diffirent ratios
  - 2) Results from diffirent clustering methods
  - 3) Results from Clustering vs retrained CNN
- H) Conclusions

#**A) Abstract:**

For our project we have decided to do Project 6, choosing specifically Fashion MNIST. We use weakly supervised classification to try and improve results compared to just using supervised learning.
 

#**B) High Level Goals and Methedology:**

Out high level goal is to use weakly supervised classification to improve our 
prediction ability compared to just training using labeled images. We hope to 
achieve atleast a 5% increase in our test prediction score using clustering 
methods.

Our method is split into two experiments, 
- we first will test to see training
a CNN on N-M labeled images, then use a cluestering method to classify all the 
N images, using the majority label in each cluster as the predicted label
- secondly we will try using the predicted labels from the previous step to 
train a new CNN model, to see if it performs better

Our baseline will be a simple CNN of N+1 (with the K extra classes being labelled as "unknown") to differentiate between unlabeled and labeled classes. Then simply run (some unsupervised model) on the unlabeled data. 

#**C) Team Members and Contributions**

- Hardik Paliwal
  - Created CNN based on VGG11 for training
  - Created function to gather features from pretrained CNN (trained on M labeled images)
  - Created method to get predicted labels from clustering methods

- Lance Pereira
  - (TODO) Modified VGG11 like CNN for better results
  - Created experiments for ratios of M/N 
  - Split data into labeled and unlabeled
  - Created experiments for diffirent clustering methods
    - Kmeans, Kmedians, Kmodes
    - GMM
  - Created experiments for final classification using Kmeans vs using predicted labels to retrain CNN

#**D) External Code Libraries**

We used

- Pytorch
  - Because it was crucial for the quick training of our models
  - Allowed us to not have to deal with calculating back propogation
- Sklearn
  - Provided us a large range of clustering methods for quick experimentation
- Numpy
  - Useful for large matrix operations

In [68]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision as tv
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torch.autograd import Variable
import sklearn
from torch import optim
import numpy as np
import itertools

%matplotlib inline

# **E) Code**

### **E) Code: Setup**

The below cells:
- Import Fshion MNIST
- define functions that strip label from data

In [89]:
# Constants
dev=torch.device("cuda") 
NUM_EPOCHS = 6
NUM_CLUSTERS = 30

In [90]:
trainset = tv.datasets.FashionMNIST(root="./", download=True,train=True,  transform=tv.transforms.Compose(
    [tv.transforms.Resize(32), tv.transforms.ToTensor()]))
# trainloader = DataLoader(trainset, batch_size=128, shuffle=True)

testset = tv.datasets.FashionMNIST(root="./", download=True,train=False,  transform=tv.transforms.Compose(
    [tv.transforms.Resize(32), tv.transforms.ToTensor()]))
testloader = DataLoader(testset, batch_size=128, shuffle=True)

### **E) Code: Define our Base CNN (Based on VGG11), Test, Train methods**

In [None]:
#This is an implementation of VGG11 (which is a precursor to VGG16) for mnist dataset.
# it also takes in n, which is the number of classes. N+1 class stands for unknown. 

#this will let us differeniate the unlabelled data from the labelled data
class BasicNet(nn.Module):
    def __init__(self, n=9):
        super(BasicNet, self).__init__()
        self.batchNorm = [nn.BatchNorm2d(64), nn.BatchNorm2d(128),nn.BatchNorm2d(256), nn.BatchNorm2d(256),
                          nn.BatchNorm2d(512), nn.BatchNorm2d(512), nn.BatchNorm2d(512), nn.BatchNorm2d(512)]
        self.conv = [
        nn.Conv2d(1, 64, 3, 1, 1) ,nn.Conv2d(64, 128, 3, 1, 1), nn.Conv2d(128, 256, 3, 1, 1), nn.Conv2d(256, 256, 3, 1, 1)
       ,nn.Conv2d(256, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1), nn.Conv2d(512, 512, 3, 1, 1)
        ]
        maxPool = nn.MaxPool2d(2, stride=2)
        self.conv1 = nn.Sequential(self.conv[0], self.batchNorm[0], nn.ReLU(), maxPool)
        self.conv2 = nn.Sequential(self.conv[1], self.batchNorm[1], nn.ReLU(), maxPool)
        self.conv3 = nn.Sequential(self.conv[2], self.batchNorm[2], nn.ReLU())
        self.conv4 = nn.Sequential(self.conv[3], self.batchNorm[3], nn.ReLU(), maxPool) 
        self.conv5 = nn.Sequential(self.conv[4], self.batchNorm[4], nn.ReLU())
        self.conv6 = nn.Sequential(self.conv[5], self.batchNorm[5], nn.ReLU(), maxPool)
        self.conv7 = nn.Sequential(self.conv[6], self.batchNorm[6], nn.ReLU()) 
        self.conv8 = nn.Sequential(self.conv[7], self.batchNorm[7], nn.ReLU(), maxPool)
        self.fc1 = nn.Linear(512, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, n+1)
        
    def forward(self, x, feature_embedding=False):
        dropOut = nn.Dropout(p=0.5)

        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.conv5(x)
        x = self.conv6(x)
        x = self.conv7(x)
        x = self.conv8(x)
        x = torch.flatten(x, 1)
        
        x = dropOut(F.relu(self.fc1(x)))
        x = dropOut(F.relu(self.fc2(x)))
        if(feature_embedding):
          return x
        #Not sure why this works without softmax. probably a reasoning givin in the paper (cause the output can range from anything (not normalized to a 0-1 probability range))
        x = self.fc3(x)
        return x

In [None]:
def test(data, net):
    net.eval()
    loss_func = nn.CrossEntropyLoss()

    total_correct = 0
    total_loss = 0
    with torch.no_grad():
        correct = 0
        total = 0
        for i, (images, labels) in enumerate(data):
            images= images.to(dev)
            labels = labels.to(dev)
            test_pred = net(images)

            pred = torch.max(test_pred, 1)[1].data.squeeze()
            total_correct+= (pred == labels).sum().item()
            loss = loss_func( test_pred, labels)
            total_loss+= loss.item()*images.size(0)
        # return total_correct/len(data.dataset), total_loss/len(data.dataset)
        # Note From Lance: We shouldn't divide the total loss
        return total_correct/len(data.dataset), total_loss
  
def train(num_epochs, net, trainloader):
    optimizer = optim.SGD(net.parameters(), lr=0.01)
    loss_func = nn.CrossEntropyLoss()

    accuracy_through_epochs = []
    total_step = len(trainloader)
    
    for epoch in range(num_epochs):
        net.train()
        for i, (images, labels) in enumerate(trainloader):
            images= images.to(dev)
            labels = labels.to(dev)
            optimizer.zero_grad()           
            prediction = net(images)
            loss = loss_func( prediction, labels)
            loss.backward()
            optimizer.step()
            if ((i +1) % 100 == 0):
                print(f"Epoch {epoch+1} / {num_epochs}, Step {i+1}/ {total_step} , Loss {loss.item()}")

    return accuracy_through_epochs, net

### **E) Code: Split Fashion MNIST training data into M labeled images, and N-M unlabeled**

In [98]:
#Modifies dataset in place to only have values correspounding to the labels in classesToUse
def splitTrainingData(training_data, M_percent):
  # N is len(training_data)
  len_N = len(trainset)

  # M is the number of labeled images we want
  len_M = int(M_percent*len_N)
  
  labeled_data, unlabeled_data = torch.utils.data.random_split(trainset, [len_M, len_N - len_M])

  # strip the labels from unlabeled_data
  labeled_data_loader = DataLoader(labeled_data, batch_size=128, shuffle=True)
  unlabeled_data_loader = DataLoader(unlabeled_data, batch_size=128, shuffle=True)

  return labeled_data_loader, unlabeled_data_loader, labeled_data, unlabeled_data


In [111]:
labeled_train_loader, unlabeled_train_loader, labeled_train_data, unlabeled_train_data = splitTrainingData(trainset, 0.7)

### **E) Code: Train Base CNN on M labeled training images**

In [83]:
net = BasicNet()
net.to(dev)
result, trained_net = train(NUM_EPOCHS, net, labeled_train_loader)

Epoch 1 / 6, Step 100/ 329 , Loss 0.511478066444397
Epoch 1 / 6, Step 200/ 329 , Loss 0.4268065094947815
Epoch 1 / 6, Step 300/ 329 , Loss 0.3081866502761841
Epoch 2 / 6, Step 100/ 329 , Loss 0.3177548348903656
Epoch 2 / 6, Step 200/ 329 , Loss 0.25087860226631165
Epoch 2 / 6, Step 300/ 329 , Loss 0.3186558187007904
Epoch 3 / 6, Step 100/ 329 , Loss 0.20357073843479156
Epoch 3 / 6, Step 200/ 329 , Loss 0.2981604039669037
Epoch 3 / 6, Step 300/ 329 , Loss 0.2449653446674347
Epoch 4 / 6, Step 100/ 329 , Loss 0.194676473736763
Epoch 4 / 6, Step 200/ 329 , Loss 0.23580771684646606
Epoch 4 / 6, Step 300/ 329 , Loss 0.3226121664047241
Epoch 5 / 6, Step 100/ 329 , Loss 0.14157429337501526
Epoch 5 / 6, Step 200/ 329 , Loss 0.12187501788139343
Epoch 5 / 6, Step 300/ 329 , Loss 0.1241547167301178
Epoch 6 / 6, Step 100/ 329 , Loss 0.0540289506316185
Epoch 6 / 6, Step 200/ 329 , Loss 0.1819475293159485
Epoch 6 / 6, Step 300/ 329 , Loss 0.14424601197242737


### **E) Code: Use the 'semi' trained CNN to gain important features (feature embedings) of N-M *unlabeled* training images**

In [84]:
#do this to store the results in 1 numpy array of 10000 images vs like 20 batches of size 128 images. 
#get memory error when doing it on a batch of size len(trainloader), so we have to combine the results for trainloader

def getFeatureEmbedings(dataloader):
  featureEmbed = []
  predictedDigit = []

  with torch.no_grad():
    for i, (images, labels) in enumerate(dataloader):
        images= images.to(dev)
        labels = labels.to(dev)
        featureEmbed.append(net(images, feature_embedding=True).to("cpu").numpy())
        pred = net(images)
        predictedDigit.append(torch.max(pred, 1)[1].data.squeeze().to("cpu").numpy())

  # flatten lists
  featureEmbed = np.array(list(itertools.chain(*featureEmbed)))
  predictedDigit = np.array(list(itertools.chain(*predictedDigit)))

  return featureEmbed, predictedDigit

In [105]:
# We need the data with the labels
trainFeatureEmbed, trainPredictedDigit = getFeatureEmbedings(labeled_train_loader)

### **E) Code: Train Cluster methods on train N-M *unlabeled* images**

In [106]:
from sklearn.cluster import MiniBatchKMeans

kmeans = MiniBatchKMeans(n_clusters = NUM_CLUSTERS)
kmeans.fit(trainFeatureEmbed)

MiniBatchKMeans(n_clusters=30)

In [107]:
#In order to see how well k-means did we can use this supervised method of defining what a cluster is by seting the cluster label as the most common digits in that cluster
#unsupervised approaches include: manually selecting class depending on mean image
def retrieve_cluster_to_classification(cluster_labels,y_train):
  reference_labels = {}
# For loop to run through each label of cluster label
  for i in range(len(np.unique(kmeans.labels_))):
    index = np.where(cluster_labels == i,1,0)
    num = np.bincount(y_train[index==1]).argmax()
    reference_labels[i] = num
    # TODO: Right now the refrence label just maps to the majority label, should we also consider the 2nd and 3rd highest
  return reference_labels

### **E) Code: Define validation functions a) cluster final, b) retrained CNN**



In [132]:
def cluster_final_classifier_validation(my_labeled_trainset, my_unlabeled_trainset, unlabeled_train_loader, cluster_method):
  # mapping from NUM_CLUSTERS to FASHION_MNIST classes
  reference_labels = retrieve_cluster_to_classification(
      cluster_method.labels_, 
      np.array(my_labeled_trainset.dataset.targets[my_labeled_trainset.indices])
  )
  
  unlabelTrainFeatureEmbed, _ = getFeatureEmbedings(unlabeled_train_loader)

  # Assign new labels
  predicted_test = cluster_method.predict(unlabelTrainFeatureEmbed)
  for i in range(unlabelTrainFeatureEmbed.shape[0]):
    my_labeled_trainset.dataset.targets[my_labeled_trainset.indices[i]] = reference_labels[predicted_test[i]]

  new_combined_data = torch.utils.data.ConcatDataset([my_labeled_trainset, my_unlabeled_trainset])
  new_combined_data_loader = DataLoader(new_combined_data, batch_size=128, shuffle=True)
  retrained_net = BasicNet()
  retrained_net.to(dev)
  result, retrained_net = train(NUM_EPOCHS, retrained_net, new_combined_data_loader)
  return retrained_net



#**Experiments**


#**Results**


###First we will look at the baseline results from the CNN trained on M labeled images

In [108]:
testFeatureEmbed, _ = getFeatureEmbedings(testloader)

In [134]:
baseline_accuracy, loss = test(testloader, net)
print(f"Our accuracy with just the CNN is: {baseline_accuracy} on test set from training on trainset" )

Our accuracy with just the CNN is: 0.8698 on test set from training on trainset


### Next we will look at all the other methods


In [133]:
retrained_net = cluster_final_classifier_validation(labeled_train_data, unlabeled_train_data, unlabeled_train_loader, kmeans)
# print(f"Our accuracy with kmeans is: {accuracy} on test set from training on trainset" )

Epoch 1 / 6, Step 100/ 469 , Loss 1.1195118427276611
Epoch 1 / 6, Step 200/ 469 , Loss 0.9911301136016846
Epoch 1 / 6, Step 300/ 469 , Loss 0.8660149574279785
Epoch 1 / 6, Step 400/ 469 , Loss 0.7618424892425537
Epoch 2 / 6, Step 100/ 469 , Loss 0.8448086380958557
Epoch 2 / 6, Step 200/ 469 , Loss 0.854750394821167
Epoch 2 / 6, Step 300/ 469 , Loss 0.7505749464035034
Epoch 2 / 6, Step 400/ 469 , Loss 0.7192577123641968
Epoch 3 / 6, Step 100/ 469 , Loss 0.8394836187362671
Epoch 3 / 6, Step 200/ 469 , Loss 0.7148918509483337
Epoch 3 / 6, Step 300/ 469 , Loss 0.681002676486969
Epoch 3 / 6, Step 400/ 469 , Loss 0.7149963974952698
Epoch 4 / 6, Step 100/ 469 , Loss 0.701129674911499
Epoch 4 / 6, Step 200/ 469 , Loss 0.7081258296966553
Epoch 4 / 6, Step 300/ 469 , Loss 0.6106261014938354
Epoch 4 / 6, Step 400/ 469 , Loss 0.7273215055465698
Epoch 5 / 6, Step 100/ 469 , Loss 0.6280008554458618
Epoch 5 / 6, Step 200/ 469 , Loss 0.6560176610946655
Epoch 5 / 6, Step 300/ 469 , Loss 0.6375848054885

In [135]:
accuracy_70_label_30_not, loss = test(testloader, retrained_net)
print(f"Our accuracy with just the CNN is: {accuracy_70_label_30_not} on test set from training on trainset" )

Our accuracy with just the CNN is: 0.7596 on test set from training on trainset


In [73]:
#BUT Note our 30 classes for k-means and the use of the supervised cluster classification. We only get around 0.5 with 10 classes.
#also note that I get varying results, from 0.9 to 0.99 when running on 30 clusters.
target_test = testset.targets.numpy()
accuracy = np.sum(np.where(number_labels == target_test, 1, 0)) / target_test.shape[0]
print(f"Our accuracy with kmeans is: {accuracy} on test set from training on trainset" )

#**Conclusions**

- We found that the most useful ratio of M/N which was 30% labeled, 70% unlabeled yieled
- This compared to 50:50
- Compared to 70% labeled, 30% unlabeled