<a href="https://colab.research.google.com/github/ImanLiao/COMP3029-ComputerVision/blob/main/COMP3029_Lab06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook is prepared by Dr. Iman Yi Liao. The main objective of the lab session is to understand and learn basic functions in torch.autograd package for building solutions to optimisation problems.

Main resources:
- https://pytorch.org/docs/stable/autograd.html

# Mount Google drive

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# 3D Face Dataset
I had a 3D face dataset converted and stored in matlab .mat files. You can access the main folder [here](https://drive.google.com/drive/folders/1LRkrS3nnbLeX2dXgQrn_kzBP7Qz6DC6P?usp=sharing), data files [here](https://drive.google.com/drive/folders/13gi7l93wvBNvWeAfgWDbBHyoOUEZ3Yfh?usp=sharing), and labels [here](https://drive.google.com/drive/folders/1Fc-e2uH1M7G-Xw6zdH56dlDY1H3C4nDM?usp=sharing). Each 3D face has the same number of vertices. The index of each vertex for each face corresponds to that of another 3D face. There are several tasks I could do with the data.
1. To build a model to predict the sex of a 3D face if the relevant labels are available.
2. To build a model to cluster the 3D face dataset into two categories (hypothetically as male and female) when no labels are available
3. To build a model to cluster a 3D face into different regions (hypothetically each region may have a correspondence to the anatomical structure of a face)

You can probably think of more...but we will only demonstrate on the task 1 here.

In [1]:
%cd /content/gdrive/My Drive/USF 3D Face Database

/content/gdrive/My Drive/USF 3D Face Database


# Define the USF3DFaceDataset

In [23]:
import torch
import numpy as np
import scipy.io as spio
from torch.utils.data import Dataset
from torch.utils.data import DataLoader, random_split
# from torch.autograd import Variable
import os
import torch.optim as optim

In [24]:
# define custom USF 3D face dataset
class USF3DFaceDataset(Dataset):
    def __init__(self, data_root):
        self.data_root = data_root
        self.samples = []
        self._init_dataset()

    def __len__(self):
        return len(self.samples)

    def __getitem__(self,index):
        return self.samples[index]

    def _init_dataset(self):
        genderfolder = os.path.join(self.data_root,'labels_gender/')
        for gender in os.listdir(genderfolder):
            labels = spio.loadmat(genderfolder + gender)['targets']
            labels = torch.from_numpy(labels).float()

        datafolder = os.path.join(self.data_root, 'data/')
        for i in range(len(os.listdir(datafolder))):
            data = spio.loadmat(datafolder + 'faceobject' + ('%s' % (i+1)))['alignedFace']
            data = torch.from_numpy(data).float()
            self.samples.append((data, labels[i]))


In [25]:
# Load the dataset
directory = '/content/gdrive/My Drive/USF 3D Face Database/'

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
                            
facedataset = USF3DFaceDataset(directory)
print(type(facedataset))
print(np.shape(facedataset[0][0]), type(facedataset[0][0]))
print('USF3D Face Dataset is loaded: %d samples' % len(facedataset))


<class '__main__.USF3DFaceDataset'>
torch.Size([75972, 3]) <class 'torch.Tensor'>
USF3D Face Dataset is loaded: 100 samples


# Define my own model for sex prediction

In [26]:
# Define my own model
class modelAutoGroupingWithGrad:
    # model: tr{((GF)W)B}-y = 0, where F is the original data, G the grouping matrix, W the feature extractor, and B the GroupLasso coefficients, y the label
    
    def __init__(self,datasize,nGroups=10,dimFeatures=8):
        self.datasize = datasize # size = [N,D] where N is the number of points and D the dimension of each point, e.g. N=1000, D=3
        self.G = torch.rand(nGroups,self.datasize[0],requires_grad=True) # G is the grouping matrix, converting N points into K=nGroups, G is zero-one matrix
        self.W = torch.randn(self.datasize[1],dimFeatures,requires_grad=True) # W is the feature extractor
        self.B = torch.randn(dimFeatures,nGroups,requires_grad=True) # B is the GroupLasso coefficients
        
    def forward(self,x):
        # pass the input tensor through each of the operations
        x = torch.matmul(self.G, x)
        x = torch.matmul(x, self.W)
        x = torch.matmul(x, self.B)
        x = torch.trace(x)
        return x

    def loss(self,x,y):
        # calculate the loss
        # using square loss here
        return (self.forward(x) - y)**2 

    def batch_loss(self, batchdata, para_IRLS):
        # define the average loss of a batch
        faces, labels = batchdata
        loss = 0
        for i in range(len(faces)):
            # loss += (self.forward(faces[i]) - labels[i])**2
            loss += self.loss(faces[i], labels[i])

        temp = torch.matmul(self.B, para_IRLS.sqrt())
        temp = torch.matmul(temp.T, temp)
        loss_L1 = torch.trace(temp) + torch.sum(1/torch.diag(para_IRLS))
        
        return loss/len(batchdata) + loss_L1
        # return loss/len(batchdata)
    

    def train_with_grad(self, trainset, valset, para_IRLS, total_epoch, learning_rate):
        # training the model with the given trainset
        train_loader = DataLoader(trainset, batch_size=5, shuffle=True, num_workers=2)
        val_loader = DataLoader(valset, batch_size=5, shuffle=True, num_workers=2)

        # construct model parameters as iterable
        model_parameters = iter([self.G, self.W, self.B])
        #optimizer = optim.SGD(model_parameters, learning_rate, momentum=0.9)
        optimizer = optim.Adam(model_parameters, learning_rate)

        print(self.G)
        
        for epoch in range(total_epoch):
            running_loss = 0
            for i, batchdata in enumerate(train_loader,0):
                # zero the parameter gradients
                optimizer.zero_grad()
                loss = self.batch_loss(batchdata, para_IRLS)
                loss.backward()
##                self.W.data.sub_(self.W.grad * learning_rate)
##                self.B.data.sub_(self.B.grad * learning_rate)
##                self.G.data.sub_(self.G.grad * learning_rate)
                optimizer.step()

##                print('parameter G at batch %d in epoch %d :' % (i, epoch))
##                print(self.G)
                
                running_loss += loss
            running_loss = running_loss / i

            # validating the model within the current epoch
            with torch.no_grad():
                val_loss = 0
                for i, batchdata in enumerate(val_loader,0):
                    val_loss += self.batch_loss(batchdata, para_IRLS)
                val_loss = val_loss / i

            # Here we should plot the training loss and validation loss -- to be done!!!
            print('[Epoch %d] Training loss: %.3f, Validation loss: %.3f' % (epoch + 1,  running_loss, val_loss ))
            

    def train_IRLS(self, trainset, valset, max_IRLS_iter=50, total_epoch=50, learning_rate=0.0001):
        ### to initialise and to define, para_IRLS is the parameter introduced by the iteratively reweighted least square method
        
        para_IRLS = torch.sum(torch.mul(self.B, self.B), dim=0) + 0.1e-5
        # para_IRLS = torch.diag(torch.matmul(self.B.t(), self.B)) + 0.1e-5
        para_IRLS = para_IRLS.detach()
        para_IRLS = 1 / para_IRLS.sqrt()
        para_IRLS = torch.diag(para_IRLS)

        # IRLS algorithm
        stop = False
        n_iter = 0
        while not stop:
            # optimize w.r.t. G,W,B whike keeping para_IRLS fixed
            print('IRLS iteration %d begins...' % (n_iter+1))
            self.train_with_grad(trainset, valset, para_IRLS, total_epoch, learning_rate)

            temp = torch.sum(torch.mul(self.B, self.B), dim=0) + 0.1e-5
            temp = temp.detach()
            temp = 1 / temp.sqrt()
            diff = torch.norm(torch.diag(para_IRLS) - temp)
            print('Difference between the values of IRLS parameters in two successive iterations: %.6f' % diff)
            para_IRLS = torch.diag(temp)
            n_iter += 1
            stop = (diff < 1e-6) or (n_iter >= max_IRLS_iter)
            

    # def test(self, testset):


    def predict(self, inputs):
        predict_labels = []
        for i in range(len(inputs)):
            predict_labels.append(self.forward(inputs[i]))

        return predict_labels


# Train the model with facedataset

In [27]:
trainset, valset = random_split(facedataset, [80, 20])

# Create the auto grouping model
size = np.shape(trainset[0][0])
model = modelAutoGroupingWithGrad(size)

# Train the model (and validate it while training)
EPOCH = 50
LR = 0.0001
MAX_IRLS_ITERATION = 10
model.train_IRLS(trainset, valset, max_IRLS_iter=MAX_IRLS_ITERATION, total_epoch=EPOCH, learning_rate=LR)

IRLS iteration 1 begins...
tensor([[0.0671, 0.5402, 0.2105,  ..., 0.1065, 0.9351, 0.2561],
        [0.2123, 0.2237, 0.0709,  ..., 0.0474, 0.1605, 0.1908],
        [0.8554, 0.2697, 0.2406,  ..., 0.4311, 0.3233, 0.0775],
        ...,
        [0.2030, 0.4629, 0.3604,  ..., 0.8924, 0.0200, 0.3272],
        [0.2567, 0.5851, 0.1198,  ..., 0.0579, 0.3515, 0.6080],
        [0.4711, 0.9480, 0.7919,  ..., 0.6823, 0.3416, 0.7124]],
       requires_grad=True)
[Epoch 1] Training loss: 136432288.000, Validation loss: 54131468.000
[Epoch 2] Training loss: 24762522.000, Validation loss: 50061108.000
[Epoch 3] Training loss: 21466490.000, Validation loss: 37960708.000
[Epoch 4] Training loss: 23203540.000, Validation loss: 38223228.000
[Epoch 5] Training loss: 20416908.000, Validation loss: 56042880.000
[Epoch 6] Training loss: 29289174.000, Validation loss: 49611660.000
[Epoch 7] Training loss: 17410808.000, Validation loss: 35204508.000
[Epoch 8] Training loss: 16110673.000, Validation loss: 32987754

# Using GPU device
In the above example, we have not used GPUs. [Here's an example](https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_autograd.html) as to how you may modify the above code to allow the code to make use of GPUs when they are available. 

# Using tensor.detach() and tensor.clone()
[Here's a nice post](https://medium.com/@attyuttam/5-gradient-derivative-related-pytorch-functions-8fd0e02f13c6#:~:text=You%20should%20use%20detach(),recorded%20as%20a%20directed%20graph.) on several commonly used PyTorch functions that are related to calculations of gradient/derivative, including:
- detach()
- no_grad()
- clone()
- backward()
- register_hook()

I've extracted a few keypoints here as follows:

1. detach() creates a tensor that shares storage with tensor that does not require grad. You should use detach() when attempting to remove a tensor from a computation graph. Note that detach() does not make a copy of the data.

2. torch.no_grad() is a context-manager that disabled gradient calculation.
Disabling gradient calculation is useful for inference, when you are sure that you will not call Tensor.backward(). It will reduce memory consumption for computations that would otherwise have requires_grad=True.
In this mode, the result of every computation will have requires_grad=False, even when the inputs have requires_grad=True.

3. tensor.clone()creates a copy of tensor that imitates the original tensor’s requires_grad field. We should use clone as a way to copy the tensor while still keeping the copy as a part of the computation graph it came from. Gradients propagating to the cloned tensor will propagate to the original tensor.
tensor.clone() maintains the connection with the computation graph. That means, if you use the new cloned tensor, and derive the loss from the new one, the gradients of that loss can be computed all the way back even beyond the point where the new tensor was created.
If you want to copy a tensor and detach from the computation graph you should be using **tensor.clone().detach()**

4. tensor.backward() computes the gradient of current tensor w.r.t. graph leaves.

5. tensor.register_hook() can be very useful to caluclate grad or replace normal grad calulation with customised ones but should be used with caution and only if you're advanced in gradient/derivative calculation.