# ROB 498-002/599-009 Final Project: Extending PoseCNN

Binhao QIN, Peter MNEV

# Getting Started

## Setup Code
Before getting started, we need to run some boilerplate code to set up our environment, same as previous assignments. You'll need to rerun this setup code each time you start the notebook.

First, run this cell load the autoreload extension. This allows us to edit .py source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [None]:
%load_ext autoreload
%autoreload 2

### Google Colab Setup
Next we need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section.

Run the following cell to mount your Google Drive. Follow the link, sign in to your Google account (the same account you used to store this notebook!) and copy the authorization code into the text box that appears below.

In [None]:
is_colab_env = True

try:
    from google.colab import drive
except ModuleNotFoundError as _:
    is_colab_env = False

if is_colab_env:
    drive.mount("/content/drive")

Now recall the path in your Google Drive where you uploaded this notebook, fill it in below. If everything is working correctly then running the following cell should print the filenames from the assignment:

```
["p4_helper.py", "rob599", "pose_cnn.py", "pose_estimation.ipynb"]
```

In [None]:
import os
import sys

# TODO: Fill in the Google Drive path where you uploaded the assignment
# Example: If you create a 2023WN folder and put all the files under P4 folder, then "2023WN/P4"
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '2023WN/P4'
if is_colab_env:
    GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = 'Colab Notebooks/ROB498-WN23/projects/final'
    GOOGLE_DRIVE_PATH = os.path.join("drive", "My Drive", GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
else:
    GOOGLE_DRIVE_PATH = os.path.curdir
print(os.listdir(GOOGLE_DRIVE_PATH))


# Add to sys so we can import .py files.
sys.path.append(GOOGLE_DRIVE_PATH)

Next, we install a couple packages to help with processing and visualizing object models.

In [None]:
import os
os.environ['PYOPENGL_PLATFORM'] = 'egl'

%pip install trimesh
%pip install pyrender
%pip install pyquaternion

Once you have successfully mounted your Google Drive and located the path to this assignment, run the following cell to allow us to import from the `.py` files of this project. If it works correctly, it should print the last edit time for the file `pose_cnn.py`.

In [None]:
import os
import time

os.environ["TZ"] = "US/Eastern"
time.tzset()

pose_cnn_path = os.path.join(GOOGLE_DRIVE_PATH, "pose_cnn.py")
pose_cnn_edit_time = time.ctime(
    os.path.getmtime(pose_cnn_path)
)
print("pose_cnn.py last edited on %s" % pose_cnn_edit_time)

Load several useful packages that are used in this notebook:

In [None]:
import os
import time

import matplotlib.pyplot as plt
import torch
import torchvision

%matplotlib inline

from utils import *
from rob599 import reset_seed
from rob599.grad import rel_error

# for plotting
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # set default size of plots
plt.rcParams["font.size"] = 16
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

We will use GPUs to accelerate our computation in this notebook. Run the following to make sure GPUs are enabled:

In [None]:
if torch.cuda.is_available():
    print("Good to go!")
    DEVICE = torch.device("cuda")
else:
    print("Please set GPU via Edit -> Notebook Settings.")
    DEVICE = torch.device("cpu")

## Load PROPS Pose Dataset
During the majority of our homework assignments so far, we have used the PROPS Classification or Detection datasets for image processing tasks.

In order to train and evaluate object pose estimation models, we need a dataset where each image is annotated with a *set* of *pose labels*, where each pose label gives the 3DoF position and 3DoF orientation of some object in the image.

We will use the [PROPS Pose](https://deeprob.org/datasets/props-pose/) dataset, which provides annotations of this form. 
Our PROPS Detection dataset is much smaller than typical benchmarking pose estimation datasets, and thus easier to manage in an homework assignment.
PROPS comprises annotated bounding boxes for 10 object classes:
`["master_chef_can", "cracker_box", "sugar_box", "tomato_soup_can", "mustard_bottle", "tuna_fish_can", "gelatin_box", "potted_meat_can", "mug", "large_marker"]`.
The choice of these objects is inspired by the [YCB object and Model set](https://ieeexplore.ieee.org/document/7251504) commonly used in robotic perception models.

We create a [`PyTorch Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class
named `PROPSPoseDataset` in `rob599/PROPSPoseDataset.py` that will download the PROPS Pose dataset.

Run the following two cells to set a few config parameters and then download the train/val sets for the PROPS Pose dataset.

In [None]:
import multiprocessing

# Set a few constants related to data loading.
NUM_CLASSES = 10
BATCH_SIZE = 4
NUM_WORKERS = multiprocessing.cpu_count()

In [None]:
from rob599 import PROPSPoseDataset
 
# NOTE: Set `download=True` for the first time when you set up Google Drive folder.
# Turn it back to `False` later for faster execution in the future.
# If this hangs, download and place data in your drive manually.
train_dataset = PROPSPoseDataset(
    GOOGLE_DRIVE_PATH, "train",
    download=False  # True (for the first time)
) 
val_dataset = PROPSPoseDataset(GOOGLE_DRIVE_PATH, "val")

print(f"Dataset sizes: train ({len(train_dataset)}), val ({len(val_dataset)})")

This dataset will format each sample from the dataset as a dictionary containing the following keys:

 - 'rgb': a numpy float32 array of shape (3, 480, 640) scaled to range [0,1]
 - 'depth': a numpy int32 array of shape (1, 480, 640) in (mm)
 - 'objs_id': a numpy uint8 array of shape (10,) containing integer ids for visible objects (1-10) and invisible objects (0)
 - 'label': a numpy bool array of shape (11, 480, 640) containing instance segmentation for objects in the scene
 - 'bbx': a numpy float64 array of shape (10, 4) containing (x, y, w, h) coordinates of object bounding boxes
 - 'RTs': a numpy float64 array of shape (10, 3, 4) containing homogeneous transformation matrices per object into camera coordinate frame
 - 'centermaps': a numpy float64 array of shape (30, 480, 640) containing (dx, dy, z) coordinates to each object's centroid 
 - 'centers': a numpy float64 array of shape (10, 2) containing (x, y) coordinates of object centroids projected to image plane 
 
This dataset assumes that the upper left of the image is the origin point (0, 0).

### Visualize Dataset

Now let's visualize a few samples from our validation set to make sure the images and labels are loaded correctly. In this next cell, we'll use the `visualize_dataset` function from `rob599/utils.py` to view the RGB observation and labeled pose labels for three random samples. 

In the below figure, the final column plots the centermaps for class 0, which corresponds to the master chef coffee can. This plot is included to give a sense of how the centermaps represent gradients towards the object's centroid.

In [None]:
from rob599 import reset_seed, visualize_dataset

reset_seed(0)

grid_vis = visualize_dataset(val_dataset,alpha = 0.25)
plt.axis('off')
plt.imshow(grid_vis)
plt.show()

## Extending PoseCNN

Now that we have our dataset loaded and ready to use, we'll begin implementing a variant of the [PoseCNN](https://arxiv.org/abs/1711.00199) network. This architecture is designed to take an RGB color image as input and produce a [6 degrees-of-freedom pose](https://en.wikipedia.org/wiki/Six_degrees_of_freedom) estimate for each instance of an object within the scene from which the image was taken. To do this, PoseCNN uses 5 operations within the architecture. First, a backbone convolutional feature extraction network is used to produce a tensor representing learned features from the input image. Second, the extracted features are processed by an embedding branch to reduce the spatial resolution and memory overhead for downstream layers. Third, an instance segmentation branch uses the embedded features to identify regions in the image corresponding to each object instance (regions of interest). Fourth, the translations for each object instance are estimated using a translation branch along with the embedded features. Finally, a rotation branch uses the embedded features to estimate a rotation, in the form of a [quaternion](https://en.wikipedia.org/wiki/Quaternions_and_spatial_rotation), for each region of interest.

Thr architecture is shown in more detail from Figure 2 of the [PoseCNN paper](https://arxiv.org/abs/1711.00199):

![architecture](https://deeprob.org/assets/images/posecnn_arch.png)

Now, we will implement a variant of this architecture which uses the ResNet50 with FPN as the backbone feature extractor and individual branches that perform operations based on FPN features using PyTorch and data from our `PROPSPoseDataset`. The remainder of the features for this project will be implemented in the `pose_cnn.py` file.

## Implementing Segmentation Branch

Now that we have our feature extractor setup, we'll implement the instnace segmentation branch. This branch should fuse information from the feature extractor according to the architecture diagram of PoseCNN. Specifically, the network will pass output from the feature extractor through a 1x1 convolution+ReLU layer followed by interpolation and an element wise addition. Next, the intermediate feature is interpolated back to the input image size followed by a final 1x1 convolution+ReLU layer to predict a probability for each class or background at each pixel.

### Training PoseCNN to Perform Instance Segmentation

Once you've added code to initialize and perform the forward pass of PoseCNN for feature extraction and instance segmentation, we can attempt to train this part of PoseCNN by itself. The code in the following cell will initialize a PoseCNN model and begin training it on instance segmentation only. You should expect to see your training loss decrease to ~0.1 after training for 2 epochs.

In [None]:
import time
from torch.utils.data import DataLoader
import torchvision.models as models

from rob599 import reset_seed
from pose_cnn import PoseCNN
from tqdm import tqdm

reset_seed(0)

posecnn_model = PoseCNN(
                       models_pcd = torch.tensor(train_dataset.models_pcd).to(DEVICE, dtype=torch.float32),
                       cam_intrinsic = train_dataset.cam_intrinsic).to(DEVICE)

dataloader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE)
optimizer = torch.optim.Adam(posecnn_model.parameters(), lr=0.001,
                                 betas=(0.9, 0.999))

posecnn_model.train()

loss_history = []
log_period = 5
_iter = 0

st_time = time.time()
for epoch in range(3):
    train_loss = []
    for batch in tqdm(dataloader):
        for item in batch:
            batch[item] = batch[item].to(DEVICE)
        loss_dict = posecnn_model(batch)
        optimizer.zero_grad()
        total_loss = loss_dict["loss_segmentation"]
        total_loss.backward()
        optimizer.step()
        train_loss.append(total_loss.item())
    
        if _iter % log_period == 0:
            loss_history.append(total_loss.item())
        _iter += 1
    
    print('Time {0}'.format(time.strftime("%Hh %Mm %Ss", time.gmtime(time.time() - st_time)) + \
                                  ', ' + 'Epoch %02d' % epoch + ', ' + 'Training finished' + f' , with mean training loss {np.array(train_loss).mean()}'))
    
plt.title("Training loss history")
plt.xlabel(f"Iteration (x {log_period})")
plt.ylabel("Loss")
plt.plot(loss_history)
plt.show()

### Inference for Instance Segmentation

Now that we have our segmentation network trained, we can qualitatively evaluate the segmentation results. The following notebook cell will visualize output segmentations on a sample from the validation set.

In [None]:
from torchvision.utils import make_grid

reset_seed(0)

num_samples = 3
posecnn_model.eval()

plt.text(300, -40, 'RGB', ha="center")
plt.text(950, -40, 'True\nSegmentation', ha="center")
plt.text(1600, -40, 'Predicted\nSegmentation', ha="center")

samples = []
for sample_i in range(num_samples):
    sample_idx = random.randint(0,len(val_dataset)-1)
    sample = val_dataset[sample_idx]
    
    rgb = torch.tensor(sample['rgb'][None, :]).to(DEVICE)
    _, prediction = posecnn_model({'rgb': rgb})
    prediction = prediction.cpu().numpy().astype(np.float64)
    prediction /= prediction.max()
    prediction = (np.tile(prediction, (3, 1, 1)) * 255).astype(np.uint8)
    rgb = (sample['rgb'].transpose(1, 2, 0) * 255).astype(np.uint8)
    depth = ((np.tile(sample['depth'], (3, 1, 1)) / sample['depth'].max()) * 255).astype(np.uint8)
    segmentation = (sample['label']*np.arange(11).reshape((11,1,1))).sum(0,keepdims=True).astype(np.float64)
    segmentation /= segmentation.max()
    segmentation = (np.tile(segmentation, (3, 1, 1)) * 255).astype(np.uint8)
    
    samples.append(torch.tensor(rgb.transpose(2, 0, 1)))
    samples.append(torch.tensor(segmentation))
    samples.append(torch.tensor(prediction))

img = make_grid(samples, nrow=3).permute(1, 2, 0)

plt.axis('off')
plt.imshow(img)
plt.show()

Before moving on, visually inspect the segmentation results above to ensure your forward functions and loss calculations are setup correctly.

## Putting it all together: PoseCNN

We now have all the modules needed to make up our PoseCNN architecture. In the `PoseCNN` class of `pose_cnn.py`, add the translation and rotation branches to the initialization and forward functions. During training, your PoseCNN model should output a `loss_dict` variable with loss values for segmentation, translation and rotation branches stored respectively with keys of `"loss_segmentation"`, `"loss_centermap"`, and `"loss_R"`. The segmentation loss should be calculated using `p4_helper.loss_cross_entropy`, the centroid loss should be calculated using l1Loss, and the rotation loss should be calculated using the provided helper in `p4_helper.loss_Rotation`. During inference, your model should output a dictionary of predicted poses (i.e. see `PoseCNN.generate_pose` for a formatting utility) in `output_dict` and the predicted segmentation map (post processed probabilities) in `segmentation`.

After this, you will be ready to train your PoseCNN model with all three losses:

In [None]:
import os
import time
import torch
from torch.utils.data import DataLoader
import torchvision.models as models

import rob599
from pose_cnn import PoseCNN

rob599.reset_seed(0)

dataloader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE)

posecnn_model = PoseCNN(
                models_pcd = torch.tensor(train_dataset.models_pcd).to(DEVICE, dtype=torch.float32),
                cam_intrinsic = train_dataset.cam_intrinsic).to(DEVICE)
posecnn_model.train()
    
optimizer = torch.optim.Adam(posecnn_model.parameters(), lr=0.001,
                            betas=(0.9, 0.999))


loss_history = []
log_period = 5
_iter = 0


st_time = time.time()
for epoch in range(10):
    train_loss = []
    dataloader.dataset.dataset_type = 'train'
    for batch in dataloader:
        for item in batch:
            batch[item] = batch[item].to(DEVICE)
        loss_dict = posecnn_model(batch)
        optimizer.zero_grad()
        total_loss = 0
        for loss in loss_dict:
            total_loss += loss_dict[loss]
        total_loss.backward()
        optimizer.step()
        train_loss.append(total_loss.item())
        
        if _iter % log_period == 0:
            loss_str = f"[Iter {_iter}][loss: {total_loss:.3f}]"
            for key, value in loss_dict.items():
                loss_str += f"[{key}: {value:.3f}]"

            print(loss_str)
            loss_history.append(total_loss.item())
        _iter += 1
        
    print('Time {0}'.format(time.strftime("%Hh %Mm %Ss", time.gmtime(time.time() - st_time)) + \
                                  ', ' + 'Epoch %02d' % epoch + ', ' + 'Training finished' + f' , with mean training loss {np.array(train_loss).mean()}'))    

torch.save(posecnn_model.state_dict(), os.path.join(GOOGLE_DRIVE_PATH, "posecnn_model.pth"))
    
plt.title("Training loss history")
plt.xlabel(f"Iteration (x {log_period})")
plt.ylabel("Loss")
plt.plot(loss_history)
plt.show()

### Inference

Visualize a few outputs from the full trained model. These could be improved if we used a larger model, trained for greater duration, and if we used ICP with depth data to refine the final estimates.

In [None]:
import torch
import random
from torch.utils.data import DataLoader
import torchvision.models as models

import rob599
from pose_cnn import PoseCNN, eval


rob599.reset_seed(0)

dataloader = DataLoader(dataset=val_dataset, batch_size=BATCH_SIZE)

posecnn_model = PoseCNN(
                models_pcd = torch.tensor(val_dataset.models_pcd).to(DEVICE, dtype=torch.float32),
                cam_intrinsic = val_dataset.cam_intrinsic).to(DEVICE)
posecnn_model.load_state_dict(torch.load(os.path.join(GOOGLE_DRIVE_PATH, "posecnn_model.pth")))

num_samples = 5
for i in range(num_samples):
    out = eval(posecnn_model, dataloader, DEVICE)

    plt.axis('off')
    plt.imshow(out)
    plt.show()

Finally, let's measure the quantitative accuracy of our trained model using the 5°5cm metric. That is, we'll count how many visible objects our model was able to predict correctly, where a correct prediction is defined as one with a rotation error of less than 5° and a translation error of less than 5cm.

The instructor's model, trained with the hyperparameters above achieves 29.3%.

In [None]:
import math
import torch
from torch.utils.data import DataLoader
import torchvision.models as models

import pyquaternion
from tqdm import tqdm

import rob599
from pose_cnn import PoseCNN

rob599.reset_seed(0)

dataloader = DataLoader(dataset=val_dataset, batch_size=BATCH_SIZE)

posecnn_model.load_state_dict(torch.load(os.path.join(GOOGLE_DRIVE_PATH, "posecnn_model.pth")))
posecnn_model.eval()


T_thresh = 5 # cm
R_thresh = 5 # deg

total =0
correct = 0
for batch in tqdm(dataloader):
    for item in batch:
        batch[item] = batch[item].to(DEVICE)
    pose_dict, segmentation = posecnn_model(batch)
    for bidx in range(BATCH_SIZE):
        objs_visib = batch['objs_id'][bidx].cpu().tolist()
        objs_preds = sorted(list(pose_dict[bidx].keys()))
        for objidx, objs_id in enumerate(objs_visib):
            if objs_id==0:
                continue

            total += 1
            if objs_id not in objs_preds:
                continue
            RT_pred = pose_dict[bidx][objs_id]
            RT_true = batch['RTs'][bidx][objidx].cpu().numpy()

            # Translation error
            T_pred = RT_pred[:3,3]
            T_true = RT_true[:3,3]
            T_err = 100*np.linalg.norm(T_pred-T_true) # error in cm

            # Rotation error
            R_true = pyquaternion.Quaternion(matrix=RT_true[:3,:3],atol=1e-6)
            R_pred = pyquaternion.Quaternion(matrix=RT_pred[:3,:3],atol=1e-6)

            R_rel = R_pred * R_true.conjugate
            R_err = math.degrees(R_rel.angle)

            if T_err<T_thresh and R_err<R_thresh:
                correct+=1

print("Accuracy at 5°5cm:",correct/total)

# Save Your Work
After completing this notebook, run the following cell to create a `.zip` file for you to download. 

**Please MANUALLY SAVE every `*.ipynb` and `*.py` files before executing the following cell:**

In [None]:
from rob599.submit import make_p4_submission

# TODO: Replace these with your actual uniquename and umid
uniquename = 'bhqin'
umid = 69209865

make_p4_submission(GOOGLE_DRIVE_PATH, uniquename, umid)