### Image Segmentation dataset preparation

### FIXME

1. Assign a model_name in FIXME 1
1. Choose between default and custom dataset in FIXME 2
1. Assign path of DATA_DIR in FIXME 3
1. Assign Cloud credentials in FIXME 4

In [None]:
# Define model_name workspaces and other variables
# Available models (#FIXME 1):
# 1. segformer - https://docs.nvidia.com/tao/tao-toolkit/text/semantic_segmentation/segformer.html
# 2. mask2former - https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/instance_segmentation/mask2former.html
# 3. mask_grounding_dino - https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/instance_segmentation/mask_grounding_dino.html

model_name = "mask_grounding_dino" # FIXME1 (Add the model name from the above mentioned list)

### Example dataset source and structure <a class="anchor" id="head-1.1"></a>

**Semantic Segmentation:**
We will be using the `ISBI Challenge: Segmentation of neuronal structures in EM stacks dataset` for the binary segmentation tutorial (Unet and Segformer). Please access the open source repo [here](https://github.com/alexklibisz/isbi-2012/tree/master/data) to download the data. The data is in .tif format. Copy the train-labels.tif, train-volume.tif, test-volume.tif files to `DATA_DIR`.

**Instance Segmentation:**
We will be using the `COCO dataset` (A subset in this notebook) for `Mask Grounding Dino` model. The following script will download COCO dataset automatically.

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── images
│   ├── test
│   │   ├── image_0.png
│   │   ├── image_1.png
|   |   ├── ...
│   ├── train
│   │   ├── image_2.png
│   │   ├── image_3.png
|   |   ├── ...
│   └── val
│       ├── image_4.png
│       ├── image_5.png
|       ├── ...
├── masks
    ├── train
    │   ├── image_2.png
    │   ├── image_3.png
    |   ├── ...
    └── val
        ├── image_4.png
        ├── image_5.png
        ├── ...

```
The filename should match for images and masks

In [None]:
import os

dataset_to_be_used = "default" #FIXME2 #default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset
DATA_DIR = f'/data/{model_name}' #FIXME3
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR

### Dataset download and pre-processing <a class="anchor" id="head-1"></a>

In [None]:
# Verify the downloaded dataset
if dataset_to_be_used == "default":
    if model_name == "mask_grounding_dino":
        if not os.path.exists(f"{DATA_DIR}/raw-data"):
            !bash coco/download_coco.sh $DATA_DIR
        assert (os.path.exists(f"{DATA_DIR}/raw-data"))

        # extract a subset of images from COCO dataset
        # comment out if you need full dataset
        !python3 mask_grounding_dino/extract_subset.py $DATA_DIR/raw-data/train2017 $DATA_DIR/raw-data/annotations/instances_train2017.json $DATA_DIR/raw-data/train2017_subset 100
        !python3 mask_grounding_dino/extract_subset.py $DATA_DIR/raw-data/val2017 $DATA_DIR/raw-data/annotations/instances_val2017.json $DATA_DIR/raw-data/val2017_subset true 5
        assert (os.path.exists(f"{DATA_DIR}/raw-data/train2017_subset/images"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/train2017_subset/instances_train2017.json"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/images"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/instances_val2017.json"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/label_map.json"))

        # Convert coco to odvg and contiguous format
        RESULTS_DIR = f"{DATA_DIR}/odvg/annotations"
        !mkdir -p $RESULTS_DIR
        !python3 -m pip install numpy pycocotools tqdm
        from coco.coco_to_odvg import convert_coco_to_odvg
        convert_coco_to_odvg(f"{DATA_DIR}/raw-data/train2017_subset/instances_train2017.json", RESULTS_DIR)
        assert (os.path.exists(f"{RESULTS_DIR}/instances_train2017_odvg.jsonl"))
        assert (os.path.exists(f"{RESULTS_DIR}/instances_train2017_odvg_labelmap.json"))
    elif model_name == "mask2former":
        if not os.path.exists(f"{DATA_DIR}/raw-data"):
            !bash coco_panoptic/download_coco.sh $DATA_DIR
        assert (os.path.exists(f"{DATA_DIR}/raw-data"))

        # extract a subset of images from COCO dataset
        # comment out if you need full dataset
        !python3 coco_panoptic/extract_subset.py $DATA_DIR/raw-data/train2017 $DATA_DIR/raw-data/panoptic_train2017 $DATA_DIR/raw-data/annotations/instances_train2017.json $DATA_DIR/raw-data/annotations/panoptic_train2017.json $DATA_DIR/raw-data/train2017_subset 100
        !python3 coco_panoptic/extract_subset.py $DATA_DIR/raw-data/val2017 $DATA_DIR/raw-data/panoptic_val2017 $DATA_DIR/raw-data/annotations/instances_val2017.json $DATA_DIR/raw-data/annotations/panoptic_val2017.json $DATA_DIR/raw-data/val2017_subset 5
        assert (os.path.exists(f"{DATA_DIR}/raw-data/train2017_subset/images"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/train2017_subset/masks"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/train2017_subset/instances_train2017.json"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/train2017_subset/panoptic_train2017.json"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/images"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/masks"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/instances_val2017.json"))
        assert (os.path.exists(f"{DATA_DIR}/raw-data/val2017_subset/panoptic_val2017.json"))

    else:
        assert (os.path.exists(f"{DATA_DIR}/train-volume.tif"))
        assert (os.path.exists(f"{DATA_DIR}/train-labels.tif"))
        assert (os.path.exists(f"{DATA_DIR}/test-volume.tif"))

In [None]:
if dataset_to_be_used == "default":
    if model_name == "mask_grounding_dino":
        !mkdir -p {DATA_DIR}/cloud_folders/data/train
        !tar -C {DATA_DIR}/raw-data/train2017_subset -czf \
            {DATA_DIR}/cloud_folders/data/train/images.tar.gz images
        !cp {DATA_DIR}/odvg/annotations/instances_train2017_odvg.jsonl \
            {DATA_DIR}/cloud_folders/data/train/annotations_odvg.jsonl
        !cp {DATA_DIR}/odvg/annotations/instances_train2017_odvg_labelmap.json \
            {DATA_DIR}/cloud_folders/data/train/annotations_odvg_labelmap.json

        # Organize val dataset
        !mkdir -p {DATA_DIR}/cloud_folders/data/val
        !tar -C {DATA_DIR}/raw-data/val2017_subset -czf \
            {DATA_DIR}/cloud_folders/data/val/images.tar.gz images
        !cp {DATA_DIR}/raw-data/val2017_subset/instances_val2017.json \
            {DATA_DIR}/cloud_folders/data/val/annotations.json
        !cp {DATA_DIR}/raw-data/val2017_subset/label_map.json \
            {DATA_DIR}/cloud_folders/data/val/label_map.json

    elif model_name == "mask2former":
        !mkdir -p {DATA_DIR}/cloud_folders/data/train
        !mkdir -p {DATA_DIR}/cloud_folders/data/val

        !tar -C {DATA_DIR}/raw-data/train2017_subset -czf \
            {DATA_DIR}/cloud_folders/data/train/images.tar.gz images
        !tar -C {DATA_DIR}/raw-data/train2017_subset -czf \
            {DATA_DIR}/cloud_folders/data/train/images_panoptic.tar.gz masks
        !cp {DATA_DIR}/raw-data/train2017_subset/instances_train2017.json \
            {DATA_DIR}/cloud_folders/data/train/annotations.json
        !cp {DATA_DIR}/raw-data/train2017_subset/panoptic_train2017.json \
            {DATA_DIR}/cloud_folders/data/train/annotations_panoptic.json
        !cp coco_panoptic/labelmap.json \
            {DATA_DIR}/cloud_folders/data/train/label_map_panoptic.json
        
        !tar -C {DATA_DIR}/raw-data/val2017_subset -czf \
            {DATA_DIR}/cloud_folders/data/val/images.tar.gz images
        !tar -C {DATA_DIR}/raw-data/val2017_subset -czf \
            {DATA_DIR}/cloud_folders/data/val/images_panoptic.tar.gz masks
        !cp {DATA_DIR}/raw-data/val2017_subset/instances_val2017.json \
            {DATA_DIR}/cloud_folders/data/val/annotations.json
        !cp {DATA_DIR}/raw-data/val2017_subset/panoptic_val2017.json \
            {DATA_DIR}/cloud_folders/data/val/annotations_panoptic.json
        !cp coco_panoptic/labelmap.json \
            {DATA_DIR}/cloud_folders/data/val/label_map_panoptic.json

    else:
        !python3 -m pip install Pillow opencv-python numpy
        # create images and masks from the tif files
        !bash unet/prepare_data.sh $DATA_DIR
        assert (os.path.exists(f"{DATA_DIR}/images/train"))
        assert (os.path.exists(f"{DATA_DIR}/images/val"))
        assert (os.path.exists(f"{DATA_DIR}/images/test"))
        assert (os.path.exists(f"{DATA_DIR}/masks/train"))
        assert (os.path.exists(f"{DATA_DIR}/masks/val"))

        !mkdir -p $DATA_DIR/cloud_folders/data/train/images
        !mkdir -p $DATA_DIR/cloud_folders/data/train/masks
        !tar -C {DATA_DIR}/images -czf $DATA_DIR/cloud_folders/data/train/images/train.tar.gz train
        !tar -C {DATA_DIR}/images -czf $DATA_DIR/cloud_folders/data/train/images/val.tar.gz val
        !tar -C {DATA_DIR}/images -czf $DATA_DIR/cloud_folders/data/train/images/test.tar.gz test
        !tar -C {DATA_DIR}/masks -czf $DATA_DIR/cloud_folders/data/train/masks/train.tar.gz train
        !tar -C {DATA_DIR}/masks -czf $DATA_DIR/cloud_folders/data/train/masks/val.tar.gz val

        !mkdir -p $DATA_DIR/cloud_folders/data/val/images
        !mkdir -p $DATA_DIR/cloud_folders/data/val/masks
        !tar -C {DATA_DIR}/images -czf $DATA_DIR/cloud_folders/data/val/images/train.tar.gz train
        !tar -C {DATA_DIR}/images -czf $DATA_DIR/cloud_folders/data/val/images/val.tar.gz val
        !tar -C {DATA_DIR}/images -czf $DATA_DIR/cloud_folders/data/val/images/test.tar.gz test
        !tar -C {DATA_DIR}/masks -czf $DATA_DIR/cloud_folders/data/val/masks/train.tar.gz train
        !tar -C {DATA_DIR}/masks -czf $DATA_DIR/cloud_folders/data/val/masks/val.tar.gz val

### Final step: Upload the /data folder to your cloud storage and move on to running the API requests example notebooks
When you do a ls of your bucket it should have /data folder and the subfolders we created above within in (object_detection_pyt_train, object_detection_pyt_val)

In [None]:
!python3 -m pip install --upgrade awscli
ACCESS_KEY=FIXME4.1
SECRET_KEY=FIXME4.2
BUCKET_NAME=FIXME4.3

!AWS_ACCESS_KEY_ID={ACCESS_KEY} AWS_SECRET_ACCESS_KEY={SECRET_KEY} aws s3 cp {DATA_DIR}/cloud_folders/data/train s3://{BUCKET_NAME}/data/segmentation_{model_name}_train --recursive
!AWS_ACCESS_KEY_ID={ACCESS_KEY} AWS_SECRET_ACCESS_KEY={SECRET_KEY} aws s3 cp {DATA_DIR}/cloud_folders/data/val s3://{BUCKET_NAME}/data/segmentation_{model_name}_val --recursive

In [None]:
# This will be the paths in your API/TAO-CLIENT Notebooks
train_dataset_path = f"/data/segmentation_{model_name}_train"
eval_dataset_path = f"/data/segmentation_{model_name}_val"