## LoRA module
We also provide you an toy example of LoRA module injected into any nn.Linear class. You will implement similar LoRA class for Conv2D and ConvTranspose2d.

For LoRA: https://arxiv.org/abs/2106.09685

## Methods and Results (in your report)

1. how do you implement LoRA, which part of SAM model do you fine-tune?
2. summarize number of parameters before and after LoRA
3. compare training results of completely unfreeze SAM v.s. LoRA.

## SAM target assignment

Take a closer look at the data, there's some big mask containing small masks. the SAM model will predict 3 output masks (with 3 confidence score) for a single prompt, 3 mask outputs is ordered from big to small objects. You will need to implement function to sample these 3 masks from masks of one image in SA-1B dataset.

In [25]:
def point_sample(all_masks, points_coords, points_label):
    # all_masks: [N, H, W], one image, N masks
    # points_coords: (N, 2)
    # points_label: (N, 1), 1 for foreground, 0 for background
    # return: sampled_masks: [3, H, W], masks order from big to small
    # you can modify the signature of this function

    mask_ids = []
    for i, mask in enumerate(all_masks):
        is_valid = True
        for is_fore, (x, y) in zip(points_label, points_coords):
            on_mask = mask[y][x]
            is_valid = (on_mask and is_fore) or (not on_mask and not is_fore)
            if not is_valid:
                break

        if is_valid:
            mask_ids.append(i)

    #if len(mask_ids) == 
    mask_ids.sort(key=lambda i: all_masks[i].sum())
    return all_masks[mask_ids[:3]]



def box_sample(all_masks, bbox):
    # all_masks: [N, H, W], one image, N masks
    # bbox: (xxyy)
    # return: sampled_masks: [3, H, W], masks order from big to small
    # you can modify the signature of this function
    ...

# you don't need to implement the case with both points and box

## Visulize (in your report)

show the returned 3 masks, ordered from big to small

show `point_sample()` with 1. one positive point, 2. one positive and one negative point. 3. multiple points with both positive and negative

show `box_sample()` with 1. one positive box

## Training
as described in SAM paper Section 3 and Appendix A. You simulate an interactive segmentation setup during training: you need to implement 1a. single point prompt training 1b. iterative training up to 3 iteration 2. box prompt training, only 1 iteration;

### 1a and 2: one iteration training
First, with equal probability either a foreground point
or bounding box is selected randomly for the target mask.
Points are sampled uniformly from the ground truth mask.
Boxes are taken as the ground truth mask’s bounding box,
with random noise added in each coordinate with standard
deviation equal to 10% of the box sidelength, to a maximum of 20 pixels.

### 1b: three iteration training
After making a prediction from this first prompt, subsequent points are selected uniformly from the error region
between the previous mask prediction and the ground truth
mask. Each new point is foreground or background if the error region is a false negative or false positive, respectively.
We also supply the mask prediction from the previous iteration as an additional prompt to our model. To provide
the next iteration with maximal information, we supply the
unthresholded mask logits instead of the binarized mask.
When multiple masks are returned, the mask passed to the
next iteration and used to sample the next point is the one
with the highest predicted IoU.

- You do not need to implement text prompt as there's no text data in SA-1B dataset

In [1]:
import lora_sam
import pytorch_lightning as pl

model = lora_sam.LoRASAM(4, 1)
train_loader, val_loader = lora_sam.get_loaders()
trainer = pl.Trainer(devices=1, accelerator="gpu", max_epochs=30)
trainer.fit(model, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4050 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type | Params
------------------------------
0 | sam  | Sam  | 90.8 M
------------------------------
90.8 M    Trainable params
0         Non-trainable params
90.8 M    Total params
363.145   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:00<00:00, 86.30it/s]



                                                                           

  rank_zero_warn(


Epoch 0:   0%|          | 0/8948 [00:00<?, ?it/s] 



torch.Size([1, 3, 160, 256]) torch.Size([1, 82, 160, 256])
torch.Size([1, 3, 160, 256]) torch.Size([1, 82, 160, 256])


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

## Results (in your report)

1. show the process of sampling points in multiple iteration training, how the GT mask is assigned in each iteration
1. show training / validation loss curve, for each loss and sum of all losses.
1. besides loss, think about metrics such as mIoU, mAP, how to implement them in the sam's setting, or why not to implement each metrics, what make them unsuitable for sam's task? Remember: 1) we only have masks without category 2) sam need prompts.
1. on your trained model, cherry pick good examples, but also pick bad examples, rescale the input image back to 1024x1024 pixels, pass them to the original SAM model with the original pipeline, compare to your low-resolution LoRA results, disccuss what make them good or bad.