# Assignment 02b - (*MP only*) 3D Pedestrian Detection based on Multiple Sensors


## Goals
The goal of this assignment is the same as in 02a, i.e., obtaining **pedestrian locations in the 3D world (camera reference frame)** on the dataset sequence, but this time you should present a solution which aims to take advantage of **multiple sensors**, i.e. any combination of {camera, radar, LiDAR}.


Your approach should work in the following conditions:
- minimum distance of pedestrian to camera: 5 m
- maximum distance of pedestrian to camera: 40 m
- minimum pedestrian height: 1.2 m
- maximum pedestrian height: 1.9 m

## Input
You will work with the custom Sequence of `Dataset` with `start_index=1430` and `end_index=1545`.

Some functions you implemented in notebook `fa_02a_3d_pedestrian_detection_single_camera` will be reused in this notebook, so please make sure you finished notebook `fa_02a_3d_pedestrian_detection_single_camera` beforehand.

## Output
- Plots, visualizations and videos within this notebook
- Answers to the questions within this notebook
- Evaluation metrics and plots representing the performance of your approach

### Q 02b.1 Specification of intended solution
In this notebook, you have all available sensors to your disposal.

Please describe what you will implement to achieve the goal of obtaining 3D pedestrian locations using multiple sensors.
Here are some questions to stimulate the specification of your solution:
1. What processing steps and building blocks are needed?
2. Which concepts from the lectures are you planning to incorporate?
3. Which building blocks from the practica and the assignment notebooks before will you be re-using?
4. Which intermediate results can you represent in plots/visualizations/tables to show the graders that your intended solution does the right thing?

### A 02b.1
**Your answer:** (maximum 400 words)
1. Once again, the first step to face is the detection part. This time aid from other sensors is present, especially the lidar. With the lidar detection it is possible to identify clusters of points that resemble human dimension and build a smart bounding box only around those, and not try a great amount bounding boxes hoping to get the pedestrian. Also, from the 3d analysis, the almost exact position and dimension of the clusters can be registered and paired with the possible patches. The classification part is substantially the same as in the previous case. Finally, the creation of the dictionary is much easier than before since this time it is only necessary to add the positive detections paired with their position and dimension previously registered.
2. This time, the knowledge about the lidar comes useful as well, together with all the knowledge necessary for the previous case (`fa_02a_3d_pedestrian_detection_single_camera`). Once again the classification knowledge is not strictly necessary due to the use of a pretrained classifier.
3. A part from the detection part, the procedure is very similar to that adopted in `fa_02a_3d_pedestrian_detection_single_camera`, and therefore in parts of `practicum1`. In particular the `PedestrianDetector` class can be directly imported from `fa_02a_3d_pedestrian_detection_single_camera` and other functions such as `filter_points`, `get_bbox` and `find_pedestrian` can be copied with some minor modifications to adapt them to the new scenario. Obviously, the two functions `project_points` and `BoundingBox` from `practicum1` will be used once again.
4. Once again, the idea will be showing some significant figures such as k3d plots for the 3D environment analysis and some images with bounding boxes drawn on them for the 2D analysis. With respect to the previous case, this time there will be more 3D analysis since there is access to the lidar point cloud as well.

YOUR ANSWER HERE

### Q 02b.2 Proactive reflection
1. Which assumptions does your intended solution make?
2. In which situations might your intended solution fail?
3. In which situations is the multi-sensor solution better than the camera-only solution?

### A 02b.2
**Your answer:** (maximum 400 words)
1. The main assumption is that lidar and the camera can work in perfect synchrony all the time and that all the transformations between the two are known and constant. The reason why this is fundamental is that every frame the 2D bounding box from the camera is strictly related to the 3D detection of the lidar and therefore the two must work perfectly together.
2. The solution cannot be used if there is any sort of delay or malfunctioning in the communication between camera data and lidar data, for the reasons explained above. Also, if real time is crucial, this solution might require an excessive amount of processing that might not make it viable for a real time implementation.
3. Probably in almost every situation, since it should be faster (there are much less frame proposals to classify, and more precise in the 3D detection (position and dimension). Obviously, once again, if the communication between lidar and camera is not perfect then this solution does not work and the camera only solution has to be implemented instead.

YOUR ANSWER HERE

# now: HAVE FUN & HAPPY CODING!

In [None]:
# some magic to ease iterative implementation
from IPython import get_ipython

ipython = get_ipython()
if ipython:
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

If you run into any errors, please make sure that you don't have variable names in the to-be-imported notebooks which consist of pure capital letters (such as `T`, or `XYZ`, but also `P2`).
See [ipynb docs](https://ipynb.readthedocs.io/en/stable/#import-only-definitions).

## Your own PedestrianDetector
Now it's time to create another subclass of `PedestrianDetector` and detect pedestrians in 3D in the camera frame using **multiple sensors**.
Please provide visualizations of a few intermediate steps in order to obtain partial credit for concepts/implementation and to show the graders that your approach provides the intended functionality.

We recommend to start with a simple approach and iteratively improve it based on the experience you gain along the line.



In [None]:
# Create your subclass if PedestrianDetector here
# Then instantiate it to an object called `pedestrian_detector`
# and feed it with a single measurement of the provided sequence
from ipynb.fs.defs.fa_02a_3d_pedestrian_detection_single_camera import PedestrianDetector
from common.sequence_loader import Dataset

import numpy as np
import k3d
from common.k3d_helpers import plot_axes, plot_box
from ipynb.fs.defs.practicum1 import project_points
from sklearn.cluster import DBSCAN
from common.visualization import colors_qualitative_k3d
from ipynb.fs.defs.practicum1 import BoundingBox
from common.visualization import draw_bbox_to_image
from common.visualization import showimage
from ImagePatch import ImagePatch
from BoundingBox import clip_bbox_to_image
import tensorflow as tf
import os
from preprocessing_fns import preprocessing_fn_mobilenet

from tensorflow.image import non_max_suppression

# try to delete the current instance of pedestrian_detector to avoid running into memory issues
# while programming iteratively within this notebook
# use try-except as pedestrian_detector does not exist during the first call of this cell
try:
    del pedestrian_detector
except NameError:
    pass

is_debug = True

# class MyFancyPedestrianDetector(PedestrianDetector):
#     ...
#
# pedestrian_detector = MyFancyPedestrianDetector(is_debug=is_debug)

# YOUR CODE HERE

class MyFancyPedestrianDetector(PedestrianDetector):
    
    # Copied the init function from the example above
    def __init__(self, is_debug):
        super().__init__(is_debug)
    
        
    # Define a function to remove ground from pc_lidar (similar to segment_ground_points from practicum3 but does not work with
    # PointCloud)
    def remove_ground(pc_lidar, ground_plane):
        epsilon = 0.3
        points = []
        for point in pc_lidar:
            if ground_plane[0] * point[0] + ground_plane[1] * point[1] + ground_plane[2] * point[2] + ground_plane[3] > epsilon or ground_plane[0] * point[0] + ground_plane[1] * point[1] + ground_plane[2] * point[2] + ground_plane[3] < -epsilon:
                points.append(point)
        return np.array(points)
    
    
    # Redefine the filter_points function as in fa_02a_3d_pedestrian_detection_single_camera
    def filter_points(image, points, projection_matrix):
        # Augment points dimension and project them in 2D
        points_aug = np.hstack((points, np.ones((points.shape[0], 1))))
        uvs = project_points(projection_matrix, points_aug)
        # Filter the points whose 2D projection lands outside the camera image
        points_reduced = points[uvs[:, 0] >= 0]
        uvs_reduced = uvs[uvs[:, 0] >= 0]
        points_reduced = points_reduced[uvs_reduced[:, 0] < image.shape[1]]
        uvs_reduced = uvs_reduced[uvs_reduced[:, 0] < image.shape[1]]
        points_reduced = points_reduced[uvs_reduced[:, 1] >= 0]
        uvs_reduced = uvs_reduced[uvs_reduced[:, 1] >= 0]
        points_reduced = points_reduced[uvs_reduced[:, 1] < image.shape[0]]
        uvs_reduced = uvs_reduced[uvs_reduced[:, 1] < image.shape[0]]
        return points_reduced
    
    
    # Calculate object position by projecting the x, z coordinate on the ground plane
    def get_position(x, z, ground_plane):
        a, b, c, d = ground_plane
        y = (a * x + c * z + d) / (- b)
        return np.array([x, y, z])
    
    
    # Create a function to generate the clusters from lidar points that will be analyzed
    def get_clusters(self):
    
        # Import data from measurements
        pc_lidar = self.measurements.get_lidar_points()
        ground_plane_cf = self.measurements.get_ground_plane()
        T_cam_lidar = self.measurements.get_T_camera_lidar()
        image = self.measurements.get_camera_image()
        projection_matrix = self.measurements.get_camera_projection_matrix()

        # Get the lidar points and filter them to only those of interest
        pc_lidar_cf = T_cam_lidar.dot(pc_lidar.T).T[:, 0:3]
        # Remove ground
        pc_lidar_cf = MyFancyPedestrianDetector.remove_ground(pc_lidar_cf, ground_plane_cf)
        # Consider only points within 5m and 40m
        pc_lidar_cf = pc_lidar_cf[pc_lidar_cf[:, 2] > 5]
        pc_lidar_cf = pc_lidar_cf[pc_lidar_cf[:, 2] < 40]
        # Consider only points within camera frame
        pc_lidar_cf_filtered = MyFancyPedestrianDetector.filter_points(image, pc_lidar_cf, projection_matrix)
        
        # Plot the only the filtered lidar points
        if self.is_debug:
            plot = k3d.plot()
            plot += plot_axes(np.eye(4, dtype=np.float32))
            plot += k3d.points(positions=pc_lidar_cf_filtered.astype(np.float32), point_size=0.1, color=0x0000ff)
#             plot.display()
            self.doa.add_k3d_plot(
                **{
                    "name": "Filtered lidar points",
                    "description": "A plot showing only the lidar points of interest for the analysis: no ground plane, within 5m and 40m and within camera frame.",
                    "plot": plot,
                }
            )

        # Cluster the obtained lidar points in order to identify single structures.
        # DBSCAN algorithm was chosen since it fits this problem much better than KMeans for several reasons.
        clustering = DBSCAN(eps=0.3, min_samples=20).fit(pc_lidar_cf_filtered)
        labels = clustering.labels_
        n_labels = len(np.unique(labels)) - 1
        clusters = []
        for i in range(n_labels):
            cluster = (pc_lidar_cf_filtered[labels == i])
            clusters.append(cluster)
        
        if self.is_debug:
            colors = colors_qualitative_k3d * 5
            plot_1 = k3d.plot()
            plot_1 += plot_axes(np.eye(4, dtype=np.float32))
            for i in range(n_labels):
                plot_1 += k3d.points(positions=clusters[i].astype(np.float32), point_size=0.1, color=colors[i])
#             plot_1.display()
            self.doa.add_k3d_plot(
                **{
                    "name": "Clustered lidar points",
                    "description": "A plot showing the clustered lidar points with different colours to facilitate cluster identification.",
                    "plot": plot_1,
                }
            )
        
        return clusters
    
    
    # Filter only the significant clusters and calculate their position, dimension and corners
    def filter_clusters(self, epsilon):
        
        # Get all the clusters with the previous function
        clusters = MyFancyPedestrianDetector.get_clusters(self)
        
        # Create lists
        indexes = []
        positions = []
        dimensions = []
        corners_3d = []
        
        # For each cluster find min and max value for all three dimensions
        for i in range(len(clusters)):
            cluster = clusters[i]
            x_min = np.ndarray.min(cluster[:, 0])
            x_max = np.ndarray.max(cluster[:, 0])
            y_min = np.ndarray.min(cluster[:, 1])
            y_max = np.ndarray.max(cluster[:, 1])
            z_min = np.ndarray.min(cluster[:, 2])
            z_max = np.ndarray.max(cluster[:, 2])
            
            # Considering find the middle point along x and z axis and project it on the ground plane to find the exact position
            ground_plane_cf = measurements.get_ground_plane()
            position = MyFancyPedestrianDetector.get_position((x_max + x_min) / 2, (z_max + z_min) / 2, ground_plane_cf)
            
            # Calculate the objects' dimension (notice the use of epsilon as a margin)
            w = x_max - x_min + epsilon
            h = y_max - y_min + 2 * epsilon
            l = z_max - z_min + epsilon
            dimension = np.array([l, w, h]) 
            
            # Calculate the 3D corners for the future bounding box
            corners = []
            for j in (-0.1, 0, 0.1):
                corner_1 = np.array([position[0] - w/2 - epsilon/2 - j, position[1] + epsilon + j, position[2]])
                corner_2 = np.array([position[0] + w/2 + epsilon/2 + j, position[1] + epsilon + j, position[2]])
                corner_3 = np.array([position[0] - w/2 - epsilon/2 - j, position[1] - h - j, position[2]])
                corner_4 = np.array([position[0] + w/2 + epsilon/2 + j, position[1] - h - j, position[2]])
                corner_all = np.array([corner_1, corner_2, corner_3, corner_4])
                corners.append(corner_all)
            
            # Filter only objects that stay within given height range and append position, dimension and corners
            if h > 1.2 and h < 1.9 and w < 1.5 and l < 1.5:
                indexes.append(i)
                positions.append(position)
                dimensions.append(dimension)
                for corner in corners:
                    corners_3d.append(corner)
        # Get only clusters within range
        good_clusters = [clusters[i] for i in indexes]
        
        # Plot selected cluster with position and bounding box 3D corners
        if self.is_debug:
            colors = colors_qualitative_k3d * 5
            plot_2 = k3d.plot()
            plot_2 += plot_axes(np.eye(4, dtype=np.float32))
            for i in range(len(corners_3d)):
                plot_2 += k3d.points(positions=good_clusters[int(i/3)].astype(np.float32), point_size=0.1, color=colors[int(i/3)])
                plot_2 += k3d.points(positions=positions[int(i/3)].astype(np.float32), point_size=0.3, color=0xff0000)
                plot_2 += k3d.points(positions=corners_3d[i].astype(np.float32), point_size=0.2, color=0x00ff00)
#             plot_2.display()
            self.doa.add_k3d_plot(
                **{
                    "name": "Filtered clusters",
                    "description": "A plot showing filtered clusters with their measured position (red dots) and 3D bounding box corners (green dots).",
                    "plot": plot_2,
                }
            )
            
        return positions, dimensions, np.array(corners_3d)
    
    
    # Create bounding boxes and frame proposals based on 3d corners (very similar to what was done in
    # fa_02a_3d_pedestrian_detection_single_camera)
    def get_bbox(self):
        
        # Get data from previous function
        positions, dimensions, corners_3d = MyFancyPedestrianDetector.filter_clusters(self, 0.25)

        # Project the points in 2D
        corners_3d_aug = np.dstack((corners_3d, np.ones((corners_3d.shape[0], 4, 1)))).reshape(-1, 4)
        projection_matrix = self.measurements.get_camera_projection_matrix()
        corners_2d = project_points(projection_matrix, corners_3d_aug)
        # Reshape the array to get the corners of each bounding box per row
        corners_2d = corners_2d.reshape(-1, 4, 2)

        # Create a list with a bounding box for each set of corners
        bbox_list = []
        for i in range(corners_2d.shape[0]):
            bbox = BoundingBox(corners_2d[i, 2, 1].astype(np.int32), corners_2d[i, 2, 0].astype(np.int32), corners_2d[i, 1, 1].astype(np.int32), corners_2d[i, 1, 0].astype(np.int32), from_corners=True)
            bbox_list.append(bbox)

        # Print the image with all the possible bounding boxes
        if self.is_debug:
            image_test = self.measurements.get_camera_image()
            for i in range(len(bbox_list)):
                draw_bbox_to_image(image_test, bbox_list[i])
#             showimage(image_test)
            self.doa.add_image(
                **{
                    "name": "All possible bboxes",
                    "description": "The image shows in the 2D environment all the bounding boxes that will be considered for the detection.\n",
                    "image": image_test,
                }
            )

        # Finally, crate the proposals based on the bounding boxes
        image = self.measurements.get_camera_image()
        frame_proposals = []
        for i in range(len(bbox_list)):
            clip_bbox_to_image(bbox_list[i], image.shape[:2])
            patch_image = image[bbox_list[i].v:bbox_list[i].v+bbox_list[i].h, bbox_list[i].u:bbox_list[i].u+bbox_list[i].w]
            patch = ImagePatch(patch_image, bbox_list[i])
            frame_proposals.append(patch)
        
        return positions, dimensions, frame_proposals
    
    
    # Simply apply the pretrained classifier on the proposals and identify the ones with pedestrian.
    # Also this part is almost identical to fa_02a_3d_pedestrian_detection_single_camera
    def find_pedestrian(self):
        
        # First, define the classifier. A pretrained classifier willbe used due to lack of training set.
        patch_classifier = tf.keras.models.load_model(os.path.join(os.environ["SOURCE_DIR"], "practicum1", "pedestrian_classifier"))
        
        # Get the frame_proposals with get_bbox function
        positions, dimensions, frame_proposals = MyFancyPedestrianDetector.get_bbox(self)
        
        # Preprocess the patches and classify them
        frame_patches = np.concatenate([preprocessing_fn_mobilenet(proposal_patch.image) for proposal_patch in frame_proposals], 0)
        predictions = patch_classifier.predict(frame_patches)
        # Add the score feature of each patch to its description
        for i, pred in enumerate(predictions):
            frame_proposals[i].score = pred
        
        # Filter the patches to estract only those representing pedestrian with high certainty
        threshold = 0.4
        pedestrian_patches = []
        idx = []
        for i in range(len(frame_proposals)):
            if frame_proposals[i].score >= threshold:
                pedestrian_patches.append(frame_proposals[i])
                idx.append(i)
        positions = [positions[int(i/3)] for i in idx]
        dimensions = [dimensions[int(i/3)] for i in idx]
        
        # Reduce the overlapping proposals to a single one using NMS algorithm.
        # I had to modify the code from practicum 1 since that only generated one patch even for two distinct pedestrians.
        # Notice that in this case it is very uncommon to have overlapping proposals
        all_bboxes = []
        confidences = []
        nms_patches = []
        overlap_thresh = 0.01
        for frame in pedestrian_patches:
            bbox = np.asarray([frame.bbox.get_bbox_corners()])
            bbox = bbox.reshape(4,)
            all_bboxes.append(bbox)
            confidence = np.asarray([frame.score[0]])
            confidences.append(confidence)
        confidences = np.array(confidences)
        n_confidences = confidences.shape[0]
        confidences = confidences.reshape(n_confidences,)
        if len(all_bboxes) > 0:
            idx = non_max_suppression(np.array(all_bboxes), np.array(confidences), max_output_size=len(all_bboxes), iou_threshold=overlap_thresh)
            for i in idx:
                nms_patch = pedestrian_patches[i]
                nms_patches.append(nms_patch)
            positions = [positions[i] for i in idx]
            dimensions = [dimensions[i] for i in idx]

        # Print the final result
        if self.is_debug:
            image_test_2 = self.measurements.get_camera_image()
            for pedestrian in nms_patches:
                bbox = pedestrian.bbox
                draw_bbox_to_image(image_test_2, bbox, color=(255,0,0))
#             showimage(image_test_2)
            self.doa.add_image(
                **{
                    "name": "Pedestrian patches in the image after NMS",
                    "description": "The patch with highest certainty of containing a pedestrian is shown on the image\n",
                    "image": image_test_2,
                }
            )
        
        return positions, dimensions, nms_patches
    
    
    # Create the dictionary with all the acquired information
    def get_pedestrian_dicts(self):
        
        # Get information with previous function
        positions, dimensions, nms_patches = MyFancyPedestrianDetector.find_pedestrian(self)
        
        # Create the homogeneous transformation from the object position
        T_cam_object = []
        positions = np.array(positions)
        for i in range(len(positions)):
            T_cam_object.append(np.array([[0, -1,  0, positions[i, 0]],
                                          [0,  0, -1, positions[i, 1]],
                                          [1,  0,  0, positions[i, 2]],
                                          [0,  0,  0,              1]]))
        
        # Create the final disctionary
        pedestrian_dicts = []
        for i in range(len(nms_patches)):
            p_dict = {'label_class': 'Pedestrian',
                      # For the extent_object, no way was found to precisely detect the size of the pedestrian based on the
                      # bounding box dimension since it does not reliably fit on the image, therefore some standard dimensions
                      # are used.
                      'extent_object': dimensions[i],
                      'T_cam_object': T_cam_object[i],
                      'score': nms_patches[i].score[0]}
            pedestrian_dicts.append(p_dict)
            
        return pedestrian_dicts


pedestrian_detector = MyFancyPedestrianDetector(is_debug=is_debug)
    
# raise NotImplementedError()

dataset = Dataset()
start_index = 1430
end_index = 1545
sequence = dataset.get_custom_sequence(start_index, end_index)

# get first measurements object of the sequence
measurements = next(iter(sequence))
# measurements = sequence[start_index + 77]

# feed measurements
pedestrian_detector.set_measurements(measurements)

pedestrian_dicts = pedestrian_detector.get_pedestrian_dicts()
pedestrian_dicts

In [None]:
# make sure each pedestrian_dict has all required keys present
required_keys = {"label_class", "extent_object", "T_cam_object", "score"}
for pedestrian_dict in pedestrian_dicts:
    assert required_keys.issubset(set(pedestrian_dict.keys()))

In [None]:
# make sure the pedestrian_detector object is a (duck-typed) PedestrianDetector subclass
assert isinstance(pedestrian_dicts, list)
assert {"doa", "get_pedestrian_dicts"}.issubset(set(dir(pedestrian_detector)))

In [None]:
# let's have a look at your debug outputs
# show debug outputs
#
# it is important to us, that you create sufficient intermediate results
# and also use verbose descriptions of the debug outputs
# (as you would use for captions of figures in scientific papers)
#
# you can toggle scrolling of the output by selecting this cell and 'Cell' > 'Current Outputs' > 'Toggle Scrolling'
[None for i in iter(pedestrian_detector.doa)]

### Localize pedestrians on the whole sequence

Please assemble the target structure `frame_pedestrian_dicts` below by iterating over the sequence and obtaining all pedestrian dicts.

In [None]:
from assignment.solution_helpers import DurationAggregator
from tqdm.notebook import tqdm

sequence = dataset.get_custom_sequence(start_index, end_index)
frame_pedestrian_dicts = {
    1430: [
        {
            # ...
        },
    ]  # frame_index as key. Fill me with pedestrian_dicts using your subclass of PedestrianDetector
}

is_debug = False
pedestrian_detector = None  # overwrite me with your instantiated pedestrian detector class

# YOUR CODE HERE

# Instantiate pedestrian detector
pedestrian_detector = MyFancyPedestrianDetector(is_debug=is_debug)

# raise NotImplementedError()

# log time for running detector on each measurements instance
duration_aggregator = DurationAggregator(is_print_durations=True)
for measurements in tqdm(duration_aggregator.aggregate_durations(sequence), total=len(sequence)):

    pedestrian_detector.set_measurements(measurements)
    refined_proposal_dicts_nms = pedestrian_detector.get_pedestrian_dicts()
    frame_pedestrian_dicts[measurements.get_index()] = refined_proposal_dicts_nms

In [None]:
assert len(duration_aggregator) == len(sequence)
mean_duration_s = duration_aggregator.get_mean_duration_s()
print(f"mean duration: {mean_duration_s:.2f} s")

### Q 02b.3 Runtime
Please reflect on the mean duration of your algorithm.
1. What is the mean duration of your duration on your machine?
2. How much speed-up would be needed in order to run it 'real-time' within a car given a sensor measurement update rate of 10 Hz?
3. How does the runtime compare against your camera-only detector?

Don't overoptimize: your approach should run at most 30 s per timestep (to keep our inference time during grading manageable), though somthing around 1-3 s per timestep seems a realistic goal.

### A 02b.3
**Your answer:** (maximum 150 words)
1. The mean duration on my machine is 5.54s per frame.
2. Once again the application is much slower than it should be for a real-time implementation, in particular in this case a 55.6x speed-up would be necessary to process the images with an update rate of 10Hz.
3. The runtime is slightly faster than the camera-only detector. The reason why it is not much faster, despite having much less frame proposals to classify, is due to the 3D analysis: to create the smart bounding boxes, quite some computations are required (such as clustering) and that unfortunately slows down the whole process.

YOUR ANSWER HERE

In [None]:
# check for proper format
from assignment.solution_helpers import save_frame_pedestrian_dicts

# make sure all frames within the sequence are filled with frame pedestrian dicts
assert set(frame_pedestrian_dicts.keys()) == set(sequence.get_indices())

# check for type of output
for fpds in frame_pedestrian_dicts.values():
    for fpd in fpds:
        assert {"label_class", "extent_object", "T_cam_object"}.issubset(set(fpd.keys()))
        assert fpd["T_cam_object"].shape == (4, 4)
        assert fpd["label_class"] == "Pedestrian"

# use save_frame_pedestrian_dicts with is_dry_run=True to check for serializability
is_serializable = True
try:
    save_frame_pedestrian_dicts(frame_pedestrian_dicts, is_dry_run=True)
except TypeError as e:
    print("Error, frame_pedestrian_dicts is not json serializable: %s" % str(e))
    is_serializable = False
if not is_serializable:
    assert False, "See error above"

## Quantitative Evaluation (Image Projections)
Please evaluate your detector via comparing the projected 2D bounding boxes of the `frame_pedestrian_dicts` you obtained via your approach against ground truth pedestrian bounding boxes (cf. [Practicum 1](../practicum1/practicum1.ipynb)).
Evaluation metrics will be ROC curves, average precision (IOU=0.2) and mean average precision (mAP).

### Ground Truth bounding boxes (image projections)
Let use compute `gt_bboxes` with a list of pedestrian bounding box coordinates for each frame by reusing our implementation from 02a.

In [None]:
from ipynb.fs.defs.fa_02a_3d_pedestrian_detection_single_camera import get_gt_bboxes

gt_bboxes = get_gt_bboxes(sequence)

In [None]:
assert len(gt_bboxes) == len(sequence)
assert all(len(bbox) == 4 for bboxes in gt_bboxes for bbox in bboxes)

### Prediction bounding boxes (image projections)
Let's assemble `sequence_proposals` out your `frame_pedestrian_dicts` as in 02a.

In [None]:
from ipynb.fs.defs.fa_02a_3d_pedestrian_detection_single_camera import get_sequence_proposals

sequence_proposals = get_sequence_proposals(sequence, frame_pedestrian_dicts)

In [None]:
from practicum1.ImagePatch import ImagePatch

assert len(sequence_proposals) == len(gt_bboxes)
# sequence_proposals should be of type ImagePatch and have score of proper shape and range
assert all(isinstance(sp, ImagePatch) for sps in sequence_proposals for sp in sps)
assert all(
    len(sp.score) == 1 for sps in sequence_proposals for sp in sps
), "score as in practicum1 needs to be a one-element list"
assert all(sp.score[0] >= 0.0 for sps in sequence_proposals for sp in sps)
assert all(sp.score[0] <= 1.0 for sps in sequence_proposals for sp in sps)

### Metrics Dict (image projections)
We use `generate_metrics_dict` as in Practicum 1 to evaluate `sequence_proposals` against `gt_bboxes` for the given `discrimination_thresholds` and `iou_thresholds`.

In [None]:
from practicum1.evaluation import generate_metrics_dict

discrimination_thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
iou_thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

metrics_dict = generate_metrics_dict(sequence_proposals, gt_bboxes, discrimination_thresholds, iou_thresholds)

metrics_dict

In [None]:
assert set(metrics_dict.keys()) == set(iou_thresholds)
assert all(v.shape == (len(discrimination_thresholds), 2) for v in metrics_dict.values())

### Precision-Recall Curve (image projections)
Let's plot the Precision-Recall curve for the IOU threshold of 0.2 (and interactively).
See Practicum 1.

In [None]:
# insert a precision-recall curve plot here
from ipynb.fs.defs.practicum1 import plot_pr_curve
from ipywidgets import fixed, interact, FloatSlider

interact(plot_pr_curve, metrics_dict=fixed(metrics_dict), iou_thresh=FloatSlider(min=0.0, max=1.0, step=0.1, value=0.2))

### Average Precision (image projections)
What is the `average_precision` for `iou_threshold = 0.2`?
Feel free to copy your code from 02a.


In [None]:
iou_threshold = 0.2

from sklearn.metrics import auc

precisions, recalls = metrics_dict[iou_threshold].T
average_precision = auc(recalls, precisions)

print(f"Average Precision @ IoU thresh. of {iou_threshold:.01f} = {average_precision * 100:.01f} (image projections)")

In [None]:
assert average_precision >= 0.0
assert average_precision <= 1.0

### Mean Average Precision (image projections)
What is the `mean_average_precision` (mAP) of your approach?

A basic implementation should achieve an mAP value of at least 5%.

In [None]:
from practicum1.metrics import mAP

mean_average_precision = mAP(metrics_dict)
print(f"Mean Average Precision: {mean_average_precision * 100:.01f} (image projections)")

In [None]:
# DO NOT DELETE THIS CELL!

## Video (qualitative evaluation)
Let's create a video over the whole sequence drawing the projected bounding boxes of all detected 3D pedestrians in `frame_pedestrian_dicts`.
We reuse the function `draw_pedestrian_bounding_boxes` we implemented in `fa_02b`.

In [None]:
from ipynb.fs.defs.fa_02a_3d_pedestrian_detection_single_camera import draw_pedestrian_bounding_boxes

images_draw = draw_pedestrian_bounding_boxes(frame_pedestrian_dicts, sequence)

In [None]:
# make sure we have a video along the whole sequence
assert len(images_draw) == len(sequence)
# make sure we have images of full resolution and color
assert images_draw[0].shape == (1216, 1936, 3)

Let's visualize the video inline via `create_animation`. This might take a minute.

In [None]:
from common.visualization import create_animation
from IPython.core.display import HTML

anim = create_animation(images_draw)
HTML(anim.to_html5_video())

# Birds-eye view visualization

Let's create a birds-eye view plot to judge the distance of the objects to the camera frame.

In [None]:
# extract pedestrian positions in birds-eye view for every frame_index from frame_pedestrian_dicts
import numpy as np
from collections import defaultdict

frame_ped_positions = dict()
frame_ped_scores = dict()
for frame_index, pedestrian_dicts in frame_pedestrian_dicts.items():
    frame_ped_positions[frame_index] = []
    frame_ped_scores[frame_index] = []
    for pedestrian_dict in pedestrian_dicts:
        ped_position = pedestrian_dict["T_cam_object"][[0, 2], 3]  # take only xz positions (in camera frame)
        frame_ped_positions[frame_index].append(ped_position)
        frame_ped_scores[frame_index].append(pedestrian_dict["score"])
for frame_index, ped_positions in frame_ped_positions.items():
    frame_ped_positions[frame_index] = np.asarray(ped_positions).reshape(-1, 2)
for frame_index, ped_scores in frame_ped_scores.items():
    frame_ped_scores[frame_index] = np.asarray(ped_scores).reshape(-1, 1)
frame_ped_positions[1430], frame_ped_scores[1430]  # frame 1430

In [None]:
# do the same for ground truth pedestrian positions
frame_ped_gts = dict()
for measurements in sequence:
    frame_index = measurements.get_index()
    frame_ped_gts[frame_index] = []
    # subselect pedestrians
    labels_camera = [m for m in measurements.get_labels_camera() if m["label_class"] == "Pedestrian"]
    for label_camera in labels_camera:
        ped_gt = label_camera["T_cam_object"][[0, 2], 3]  # take only xz positions (in camera frame)
        frame_ped_gts[frame_index].append(ped_gt)
for frame_index, ped_gts in frame_ped_gts.items():
    frame_ped_gts[frame_index] = np.asarray(ped_gts)
frame_ped_gts[1430]  # frame 1430

In [None]:
# create interactive plot showing detected pedestrian positions and ground truth positions
import matplotlib.pyplot as plt
from IPython.display import HTML
from matplotlib.animation import FuncAnimation

# get bounds for plotting
all_ped_positions = np.vstack(list(frame_ped_positions.values()))
all_ped_gts = np.vstack(list(frame_ped_gts.values()))
all_peds = np.vstack([all_ped_positions, all_ped_gts])
xmax, zmax = np.max(np.abs(all_peds), axis=0)  # symmetric
xmin, zmin = -xmax, -zmax
zmin = 0.0  # make plot start at camera position

fig, ax = plt.subplots(figsize=(15, 12))

def plot_ped_positions(frame_index):
    ax.cla()  # remove content from last frame
    ax.set_xlim(left=xmin - 2.0, right=xmax + 2.0)
    ax.set_ylim(bottom=zmin, top=zmax + 2.0)
    ax.set_aspect("equal")
    ax.set_xlabel("x (camera frame)")
    ax.set_ylabel("z (camera frame)")
    ax.set_title(f"frame: {frame_index}")
    ax.grid(True, alpha=0.5)
    ax.scatter(0.0, 0.0, color="r")  # camera frame

    if frame_ped_gts[frame_index].size > 0:
        ax.scatter(
            frame_ped_gts[frame_index][:, 0], frame_ped_gts[frame_index][:, 1], color="y", s=500, marker="*", alpha=0.6
        )

    if frame_ped_positions[frame_index].size > 0:
        ax.scatter(frame_ped_positions[frame_index][:, 0], frame_ped_positions[frame_index][:, 1])


ani = FuncAnimation(fig, func=plot_ped_positions, frames=list(frame_ped_positions.keys()))
plt.close()  # avoid drawing additional figure below animation
HTML(ani.to_jshtml())

## Quantitative Evaluation (birds-eye view)

Let's see what average precision (AP) and mean average precision (mAP) we get for the birds-eye view based evaluation, as was done in the notebook pedestrian detection with a _single camera_.

In [None]:
# create sequence ground plane proposals
# ground plane == XZ plane of camera frame
from assignment.evaluation_helpers import get_sequence_proposals_circle

sequence_groundplane_proposals = get_sequence_proposals_circle(frame_ped_positions, frame_ped_scores)

print('The sequence_groundplane_proposals has for every frame a list with a dict for every detection!')
print('The list for the first frame is:\n{}'.format(sequence_groundplane_proposals[0]))

In [None]:
# create ground truth sequence ground plane proposals
# ground plane == XZ plane of camera frame
from assignment.evaluation_helpers import get_GT_sequence_groundplane_proposals

GT_sequence_groundplane_proposals = get_GT_sequence_groundplane_proposals(frame_ped_gts)
    
print('The GTsequence_groundplane_proposals has for every frame a list with a dict for every pedestrian!')
print('The list for the first frame is:\n{}'.format(GT_sequence_groundplane_proposals[0]))

In [None]:
from assignment.evaluation_helpers import generate_metrics_dict_circle

discrimination_thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
iou_thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
radius_m = 3.0  # radius of representing circles for overlap computation

# generate the metrics_dict from birds-eye view based circles
metrics_dict = generate_metrics_dict_circle(sequence_groundplane_proposals,
                                            GT_sequence_groundplane_proposals,
                                            discrimination_thresholds,
                                            iou_thresholds,
                                            radius=radius_m)

### Precision-Recall Curve (birds-eye view)
Run the cell below to plot the Precision-Recall curve for the IoU threshold of 0.2 (and interactively).
See Practicum 1 and the notebook pedestrian detection with a _single camera_.

In [None]:
# plot the precision-recall curve
interact(plot_pr_curve, metrics_dict=fixed(metrics_dict), iou_thresh=FloatSlider(min=0.0, max=1.0, step=0.1, value=0.2))

### Average Precision (birds-eye view)
What is the `average_precision` for `iou_threshold = 0.2`?


In [None]:
from sklearn.metrics import auc

iou_threshold = 0.2
precisions, recalls = metrics_dict[iou_threshold].T
average_precision = auc(recalls, precisions)

print(f"Average Precision @ IoU thresh. of {iou_threshold:.01f} = {average_precision * 100:.01f} (birds-eye view)")

### Mean Average Precision (birds-eye view)
What is the `mean_average_precision` (mAP) of your approach?
Let's reuse code from Practicum 1.

A basic implementation should achieve an mAP value of at least 10%.

In [None]:
from practicum1.metrics import mAP

mAP_value = mAP(metrics_dict)

print(f'Mean Average Precision: {mAP_value*100:.01f} (birds-eye-view)')

### Q 02b.4 Interpretation of experimental results
Please interpret your experimental results:
1. Qualitative: How does your approach behave in terms of false positives and false negatives? (video / birds-eye view plot)
2. Quantitative: Please discuss the Precision-Recall plot, AP and mAP values in comparison to ideally achievable values. Compare the obtained 3D detection performance and associated runtime (this notebook), with the numbers obtained in the single-camera case (previous notebook).

### A 02b.4
**Your answer:** (maximum 350 words)
1. Like in the camera-only case, the qualititive analysis of the performance is satifying. In particular, however, there are some clear improvementa with respect the the previous scenario. First of all, for the video analysis, it is clear that there is a lower number of false negatives even if the classifier threshold is lower than in the previous case (0.4 against 0.6). The reason is that, since there are much less frame proposals, the risk of triggering a false positive is much smaller. Regarding the birds-eye view, instead, the most clear difference is that, if in the previous case the predictions were more or less close to the actual pedestrians, in this case they are absolutely exact. The reason is that, instead of creating the bouning box and then try to use them to "catch" possible pedestrians, this time the bounding boxes were built on top of possible pedestrians and are therefore exact in terms of position.
2. Once again the Precision-Recall plots are fairly standard, with an average higher precision rather than recall. In order to better analyze them is once again useful to concentrate on AP and mAP. As expected the values are higher than in the camera-only scenario. For a classifier threshold of 0.4 the AP at an IoU threshold of 0.2 are 34.6 for the image-projection evealueation and 35.3 for the birds-view evaluation. The mAP follows a similar logic, with a value of 25.3 for the image-projection and 32.0 for the birds-view. Contrarily to the previous case, this time the birds-view evaluation is more precise than the image projection. this is probably due to the concept expressed in point 1. about the position being much more exact than before. Also in this case, however, the algorithm cannot identify further pedestrians.

YOUR ANSWER HERE

The `frame_pedestrian_dicts` will not be used in successive notebooks.
If you need pedestrian positions in further notebooks, please use the serialized output of the camera-only solution.
Let's still check for data completeness in the next cell.

In [None]:
from assignment.solution_helpers import save_frame_pedestrian_dicts

# check for serializability (despite not writing out)
save_frame_pedestrian_dicts(frame_pedestrian_dicts, is_dry_run=True)

### Q 02b.5 Future Work
1. How can improve your method even more, i.e., if you had more time at your disposal?

### A 02b.5
**Your answer:** (maximum 150 words)
1. One possible field that can be improved is related to the dimension of the final object. For now the dictionary contains an estimate based on the dimension of the lidar points cluster, with an arbitrary epsilon to reach reasonable final values. This approach can be improved by an analysis on previously known dimensions of objects compared to their clusters' dimensions in order to find a more precise epsilon to use for evaluation. Also, when keeping in mind the real-time implementation, it is clear that the code can be greatly optimized and cleaned up in order to reach a faster implementation closer to the real-time requirements.

YOUR ANSWER HERE

# GREAT JOB!
You've come very far. You detected pedestrian locations in 3D around a moving vehicle from noisy sensor data of multiple sensors.
That's a great achievement.