# Street View Object Detection
In this project, we leverage computer vision techniques to detect and predict accessibility features such as wheelchairs, ramps, etc., in street view videos. The goal is to integrate the output of these detections into our SLAM (Simultaneous Localization and Mapping) model to create a comprehensive and dynamic map of the environment. The following procedures are applied to each video frame:

* Camera Calibration: This is the first step where we calibrate the camera to ensure accurate measurements and object detection. The calibration process helps us understand the camera's intrinsic and extrinsic parameters, which are crucial for mapping the image coordinates to real-world coordinates.

* Object Detection with YOLOv4: We initiate a video stream as input and run each frame through our YOLOv3 object detection model. The model identifies accessibility features and creates an overlay image that contains bounding boxes of the detected objects. These bounding boxes are then overlaid back onto the subsequent frames of our video stream, providing real-time object detection.

* Conversion to Geospatial Data: The outputs of the object detection and depth estimation models, which include the class, bounding box, and depth of each detected object, are then converted into geospatial data. This involves mapping the image coordinates of the detected objects to real-world coordinates, taking into account the camera's field of view, orientation, and location.

* Updating the SLAM Map: The geospatial data derived from the object detection and depth estimation models is used to update our SLAM map. This map, which is initially created using the SLAM algorithm, is continuously updated with new information about the location and type of accessibility features in the environment. This results in a dynamic and comprehensive map that can be used for navigation and accessibility planning.

## Camera Calibration

we first establish a pattern variable that holds the object points in the (x, y, z) coordinate space of the chessboard. Here, x and y represent the horizontal and vertical indices of the street view frames, respectively, while z is consistently set to 0. These object points remain the same for each calibration image, as we anticipate the same street view frame's pattern in each image.

Next, we get the coordinates of the corners of the calibration image.

Once we've collected all the points from each image, we compute the camera calibration matrix and distortion coefficients using the cv2.calibrateCamera() function.

In [None]:
# pattern variable
pattern = np.zeros((pattern_size[1] * pattern_size[0], 3), np.float32)
pattern[:, :2] = np.mgrid[0:pattern_size[0], 0:pattern_size[1]].T.reshape(-1, 2)

# coordinate the corners of image points
pattern_points = []
image_points = []
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
criteria = (cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER, 30, 0.001)
found, corners = cv.cornerSubPix(gray, pattern_size, (11, 11), (-1, -1), criteria)
if found:
    pattern_points.append(pattern)
    image_points.append(corners)

# compute camera calibration
ret, mtx, dist, rvecs, tvecs = cv.calibrateCamera(pattern_points, image_points, gray.shape[::-1], None, None)

# get the corrected image
h, w = img.shape[:2]
newcameramtx, roi = cv.getOptimalNewCameraMatrix(mtx, dist, (w, h), 1, (w, h))
dst = cv.undistort(img, mtx, dist, None, newcameramtx)

## Video Processing

In this section, we will employ the YOLOV3 (You Only Look Once) model to identify objects within the street view video, which includes various accessibility features.

The video is processed on a frame-by-frame basis. Each detected object is highlighted with a bounding box and labeled with its class name. This processed video, complete with object detection annotations, is then saved as a new video file for further analysis and review.

This approach allows us to visually identify and locate accessibility features within the video, aiding in the assessment of street accessibility.

In [None]:
!pip install opencv-python imageai
from google.colab.patches import cv2_imshow
from pycocotools.coco import COCO

import os
import cv2 as cv
from imageai.Detection import ObjectDetection
from collections import Counter
import pandas as pd
import numpy as np
import requests as req
import os as os

In [None]:
# load pretrained Yolo Model from Coco dataset
yolo_model = cv.dnn.readNetFromDarknet('yolov3.cfg', 'yolov3.weights')

# load class names from coco dataset
class_names = cv.dnn.readNet('coco.names')
print(class_names)

# get output layers names
output_layers_names = yolo_model.getUnconnectedOutLayersNames()

In [None]:
# load street view video
cap = cv.VideoCapture('street_view.mp4')

# process video frame by frame
frame_id = 0
while True:
    # get the current frame
    _, frame = cap.read()
    frame_id += 1

    # get the height and width of the frame
    height, width, _ = frame.shape

    # convert the frame into a blob
    blob = cv.dnn.blobFromImage(frame, 1/255, (416, 416), (0, 0, 0), swapRB=True, crop=False)

    # set the input of the model
    yolo_model.setInput(blob)
    layer_outputs = yolo_model.forward(output_layers_names)

    # get the bounding boxes, confidences and class ids
    boxes = []
    confidences = []
    class_ids = []
    for output in layer_outputs:
        for detection in output:
            # get the class probabilities
            scores = detection[5:]

            # get the class id
            class_id = np.argmax(scores)

            # get the confidence
            confidence = scores[class_id]

            # filter out weak predictions
            if confidence > 0.5:
                # get the bounding box
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[2] * width)
                h = int(detection[3] * height)

                # get the top left corner
                x = int(center_x - w/2)
                y = int(center_y - h/2)

                # update the bounding box, confidences and class ids
                boxes.append([x, y, w, h])
                confidences.append(float(confidence))
                class_ids.append(class_id)

    # apply non-max suppression
    indexes = cv.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

    # draw the bounding boxes and class labels
    font = cv.FONT_HERSHEY_PLAIN
    colors = np.random.uniform(0, 255, size=(len(boxes), 3))
    for i in indexes.flatten():
        # get the bounding box
        x, y, w, h = boxes[i]

        # get the class label
        label = str(classes[class_ids[i]])

        # get the confidence
        confidence = str(round(confidences[i], 2))

        # get the color
        color = colors[i]

        # draw the bounding box
        cv.rectangle(frame, (x, y), (x+w, y+h), color, 2)

        # draw the class label
        cv.putText(frame, label + ' ' + confidence, (x, y+20), font, 2, (255, 255, 255), 2)

    # show the frame
    cv2_imshow(frame)

    # press 'q' to exit
    if cv.waitKey(1) == ord('q'):
        break

# release the video capture object
cap.release()
cv.destroyAllWindows()