# Mini-scene creation from rescaled CVAT annotations

This notebooks provides step-by-step instructions for creating mini-scene videos from CVAT annotations.
If [downgrade.sh](/helper_scripts/downgrade.sh) is used to reduce the video size to upload to CVAT, upscale the annotations in Step 2. Step 3 converts the CVAT annotations to a tracks xml file. Step 4 uses the tracks file to extract mini-scenes videos for each animal from the the original video.

Inputs: 
- original video in mp4 format
- CVAT detection annotations in json format, containing bounding boxes around the animals in view
- directory to save tracks files

Output: 
- mini-scenes in mp4 format (video clips centered on a singular animal)

## Step 1: Enter script inputs

In [None]:
# path to video mp4
video_path = "replace_me"

# path to CVAT export xml
annotation_path = "replace_me"

# set path to save the tracks file
tracks_location = "replace_me"

# scaling factor for the video
# default from helper_scripts/downgrade.sh is to downscale video to 1/3 of original size
scaling_factor = 3 

In [None]:
# import libraries
import os
from lxml import etree
from collections import OrderedDict

## Step 2: upscale annotations to match original video resolution (optional)

This step is required if downgraded videos were used to create the bounding boxes in CVAT

In [None]:
# create tracks directory
root = etree.parse(annotation_path).getroot()
annotated = dict()
track2end = {}

# iterate over all tracks in the annotation file
for track in root.iterfind("track"):
    track_id = int(track.attrib["id"])

    # iterate over all bounding boxes in the track
    for box in track.iter("box"):
        frame_id = int(box.attrib["frame"])
        keyframe = int(box.attrib["keyframe"])

        # store the last frame of the track
        if keyframe == 1:
            track2end[track_id] = frame_id

# iterate over all tracks in the annotation file
for track in root.iterfind("track"):
    track_id = int(track.attrib["id"])

    # iterate over all bounding boxes in the track
    for box in track.iter("box"):
        frame_id = int(box.attrib["frame"])
        keyframe = int(box.attrib["keyframe"])

        # only store bounding boxes for frames that are within the track's duration
        if frame_id <= track2end[track_id]:
            if annotated.get(track_id) is None:
                annotated[track_id] = OrderedDict()
                
            # scale bounding box coordinates and store them
            annotated[track_id][frame_id] = [int(float(box.attrib["xtl"])*scaling_factor),
                                                int(float(box.attrib["ytl"])*scaling_factor),
                                                int(float(box.attrib["xbr"])*scaling_factor),
                                                int(float(box.attrib["ybr"])*scaling_factor), keyframe]

## Step 3: Create tracks file from CVAT annotations

In [None]:
# create new XML file for the tracks
xml_page = etree.Element("annotations")
xml_page.text = "\n"

# add version
xml_version = etree.SubElement(xml_page, "version")
xml_version.text = "1.1"
xml_version.tail = "\n"

# iterate over all tracks and store the bounding boxes
for track_id in annotated.keys():
    xml_track = etree.Element("track", id=str(track_id), label="Grevy", source="manual")
    xml_track.text = "\n"
    xml_track.tail = "\n"

    for frame_id in annotated[track_id].keys():
        if frame_id == sorted(annotated[track_id].keys())[-1]:
            outside = "1"
        else:
            outside = "0"

        xml_box = etree.Element("box", frame=str(frame_id), outside=outside, occluded="0",
                                keyframe=str(annotated[track_id][frame_id][4]),
                                xtl=f"{annotated[track_id][frame_id][0]:.2f}",
                                ytl=f"{annotated[track_id][frame_id][1]:.2f}",
                                xbr=f"{annotated[track_id][frame_id][2]:.2f}",
                                ybr=f"{annotated[track_id][frame_id][3]:.2f}", z_order="0")
        xml_box.tail = "\n"

        xml_track.append(xml_box)

    if len(annotated[track_id].keys()) > 0:
        xml_page.append(xml_track)


# Parse the original XML file
original_tree = etree.parse(annotation_path)
original_root = original_tree.getroot()

# Extract the 'meta' element
meta = original_root.find("meta")

# Update height & width
height = int(meta.find("task").find("original_size").find("height").text) * scaling_factor
meta.find("task").find("original_size").find("height").text = str(height)

width = int(meta.find("task").find("original_size").find("width").text) * scaling_factor
meta.find("task").find("original_size").find("width").text = str(width)

# Append 'meta' to the new XML document
# Note: 'meta' should be appended before the 'track' elements, manually check this if errors occur
track = xml_page.find("track")
parent = track.getparent()

if parent is not None:
    index = parent.index(track)
    parent.insert(index, meta)
else:
    xml_page.append(meta)

# Write the new XML document to file
etree.indent(xml_page, space='  ', level=0)
xml_document = etree.ElementTree(xml_page)
xml_document.write(f"{tracks_location}/tracks_.xml", xml_declaration=True, pretty_print=True, encoding="utf-8")

## Step 4: Create mini-scenes using tracks_extractor

Note: if an error occurs, make sure the meta section was added back into the xml correctly.

In [None]:
# extract mini-scenes from original video using tracks file
os.system(f"tracks_extractor --video {video_path} --annotation {tracks_location}/tracks_.xml")