Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate the vertical distance from a D455 to the surface represented by each pixel value in a depth image? #11657

Closed
uzgit opened this issue Apr 6, 2023 · 9 comments

Comments

@uzgit
Copy link

uzgit commented Apr 6, 2023

Required Info
Camera Model D455
Firmware Version 05.14.00.00
Operating System & Version PC, Ubuntu 22.04
Kernel Version (Linux Only) 5.19.0-38-generic
Platform PC
SDK Version 2.53.1
Language Python
Segment Robot

Issue Description

I am working on a project where I need to identify contiguous flat/level regions using a depth image. When the camera is pointed in the direction of the gravity vector, this is easy because the vertical distance is exactly the distance represented by the depth frame. For example, in the image below, I can simply flood fill (with a small tolerance) from the center of the image and easily identify the top of the Turtle Bot as a contiguous, level region (highlighted in green). The red regions are identified as particularly "un-level" because their gradients do not align well enough with the gravity vector.
image

And here it can identify the floor:
image

The problem becomes harder when the camera is not pointing straight down, mostly because I don't know how to correctly transform the depth image. The flood fill just floods until it reaches its tolerance and then stops, so I get results like below:
image

The non-zero orientation (in pitch/roll) of the camera is causing an apparent gradient over surfaces that are really level. I cannot assume the orientation of the camera will be straight down, so I need to transform the depth data. I have the gravity vector using a Python implementation of rs-motion's RotationEstimator class. I don't need the yaw because it does not bias the appearance of level surfaces (and anyway it drifts significantly over time).

Problem

How do I transform the depth data correctly? I would like to create a 2D image in the same dimensions as the depth image, where the value at each pixel corresponds to the vertical distance (with respect to the gravity vector) from the camera to the detected surface/object. I have tried using the rs2_deproject_pixel_to_point function to get the (x,y,z) coordinates represented by each pixel, and then rotate them by the inverse of the apparent gravity vector detected by the camera as below:

height = numpy.zeros( depth_data.shape )
rotation = Rotation.from_rotvec( gravity_vector )
for u in range(480):
    for v in range(640):
        point = rs.rs2_deproject_pixel_to_point( intrinsics, (u,v), depth_data[u,v] )
        point = rotation.inv().apply(point)
        height[u,v] = point[2]

However, I get similar results to above, and if I visualize the height matrix, the gradient is still visible:
image

Any insight on how to calculate the vertical distance from the D455 to the surface at each pixel would be greatly appreciated!

@uzgit uzgit changed the title How to calculate the vertical distance from a D455 to the surface represented by each pixel value? How to calculate the vertical distance from a D455 to the surface represented by each pixel value in a depth image? Apr 6, 2023
@MartyG-RealSense
Copy link
Collaborator

Hi @uzgit If the camera is calculating the distance from the camera of a coordinate from an angle instead of being straight-ahead level with the coordinate, a discussion at #9297 may be a useful place to start.

@uzgit
Copy link
Author

uzgit commented Apr 7, 2023

Hi @MartyG-RealSense, thank you for the recommendation. I have looked at the post and have implemented some things to test.

From what I understand from these posts, I can treat the depth image such that each pixel has (in the v dimension) an equal share of the camera's vertical field of view. Is this correct? In this case, the change in angle for each pixel is constant, and I can simply add the pitch of the camera to the pitch offset of the pixel to get the pitch of the pixel. (I am assuming that the orientation of the camera should go through the center of the image (or maybe at the principal points?)

image

Here, the pitch angle of a particular pixel should be
$$\theta_{p,v} = \theta_p + \left( c_v + u_v \right)\cdot\theta_\mathrm{\Delta p}$$

  • $\theta_{p,v}$ is the pitch angle of the pixel being processed,
  • $\theta_p$ is the pitch angle of the camera (oriented so that $\theta_p=0$
    means that the camera is pointing in the direction of gravity, $\theta_p=\frac{\pi}{2}$ means that the
    camera is upright and facing straight forward, and $\theta_p=-\frac{\pi}{2}$ means that the camera is
    upside down and facing backwards),
  • $c_v$ is the center pixel position of the image in the $v$ dimension,
  • $u_v$ is the pixel postion of the pixel being processed (here it is positive because it is $v=0$ at the top and $v=\mathrm{height}$ at the bottom), and
  • $\theta_\mathrm{\Delta p} = \theta_\mathrm{vfov} / \mathrm{height}$ is the per-pixel change in pitch.

Then, I should be able to use simple trigonometry to calculate $h$ in the following diagram, by adding to $\theta_p$:

image

with

$$ h = d \cos \left( \theta_{p,v} \right) = d \cos \left( \theta_p + \left( c_v + u_v \right)\cdot\theta_\mathrm{\Delta p} \right)$$


I've implemented the following function to test this but I get very similar results to before:

def find_contiguous_regions( depth_data, gradient_mask, camera_euler_ryp, intrinsics, gravity_vector ):

    # from rotation estimator
    camera_roll  = camera_euler_ryp[0]
    camera_pitch = camera_euler_ryp[2]

    # from https://www.intelrealsense.com/depth-camera-d455/, converted to radians
    hfov = 87 * numpy.pi / 180.0
    vfov = 58 * numpy.pi / 180.0

    # array of offsets from camera pitch at each v-index, assuming equal arclength per pixel
    # then tile the array into 2D, same dimensions as the depth data
    dpitch = numpy.linspace( - vfov/2, vfov/2, intrinsics.height )
    dpitch_dv = numpy.tile(dpitch, (intrinsics.width, 1)).T

    # add camera pitch to get absolute pitch of each pixel
    pitch = dpitch_dv + camera_pitch

    # multiply the cosine of the pitch angle at each pixel by the depth 
    height = numpy.cos( pitch ) * depth_data

    # these areas have already been eliminated by their gradient
    height[ gradient_mask ] = 0

    # flood fill with tolerance, starting from middle of image: (u,v) are reversed in this function
    # then convert to boolean mask
    regions = flood(height, (240, 320), tolerance=0.1)
    regions = regions.astype(bool)

    # visualize
    output = height
    output = process_for_visualization(output)
    cv2.imshow("output2", output)

    return regions, ""

d455_contiguous_faulty

It is unable to identify the whole floor, or even the whole top of the Turtle Bot because there is still a false gradient on the image which has not been removed. This gradient is visible in the height matrix as well:
image

Hoping someone has some ideas on how to proceed.

@uzgit
Copy link
Author

uzgit commented Apr 7, 2023

I have just found a small error in my code that seems to have fixed things a good bit:
dpitch = numpy.linspace( -vfov/2, vfov/2, intrinsics.height )
should have been
dpitch = numpy.linspace( vfov/2, -vfov/2, intrinsics.height )

Now I get the correct behavior:
d455_contiguous_working

This is still sensitive to roll because I have only handled pitch so far. You can see below the effect that the roll has on it:
d455_contiguous_working_except_roll

Once I figure out the roll I'll post it here.

@MartyG-RealSense
Copy link
Collaborator

It's great to hear that you have made significant progress!

The depth pixel value is a measurement from the parallel plane of the imagers and not the absolute range.

image

A RealSense team member offers advice about obtaining roll and pitch from the IMU with Python at #4391 (comment)

@uzgit
Copy link
Author

uzgit commented Apr 7, 2023

For the roll and pitch, I am using the following class:

class RotationEstimator:

    def __init__(self, alpha=0.98, accel_alpha=0.5):

        self.last_gyro_time = None
        self.theta = numpy.array([0, 0, 0], dtype=numpy.float64)
        self.alpha = alpha

        self.accel_vector = numpy.array([0, 0, 0], dtype=numpy.float64)
        self.alpha_accel = accel_alpha

    def process_gyro_data(self, frame):

        now = datetime.now()
        if( self.last_gyro_time == None ):
            self.last_gyro_time = now
            return

        dt = (now - self.last_gyro_time).total_seconds()

        frame_type = frame.get_profile().stream_type()
        data = frame.as_motion_frame().get_motion_data()

        d_theta = numpy.array([-data.z, -data.y, data.x])
        self.theta += dt * d_theta

        self.last_gyro_time = now

    def process_accel_data(self, frame):

        frame_type = frame.get_profile().stream_type()
        data = frame.as_motion_frame().get_motion_data()

        accel_vector = numpy.array([data.x, -data.y, -data.z], dtype=numpy.float64)
        accel_vector = accel_vector / numpy.linalg.norm(accel_vector)

        self.accel_vector = self.accel_vector * self.alpha_accel + accel_vector * (1 - self.alpha_accel)

        accel_theta    = numpy.array([0, 0, 0], dtype=numpy.float64)
        accel_theta[0] = numpy.arctan2(data.x, numpy.sqrt(data.y ** 2 + data.z ** 2))

        # sign flips in pitch calculation change the output interval
        # it is normally oriented so 0 pitch is straight up, and we want the opposite
        # -y / -z makes 0 pitch to be straight down,
        # and pitch becomes more positive as the camera pitches up
        accel_theta[2] = numpy.arctan2(-data.y, -data.z)

        self.accel_theta = accel_theta

        self.theta[0] = self.theta[0] * self.alpha + accel_theta[0] * (1 - self.alpha)
        self.theta[2] = self.theta[2] * self.alpha + accel_theta[2] * (1 - self.alpha)

    def get_theta_ryp(self):

        result = self.theta.copy()
        return result

    def get_gravity_vector_accel(self):

        return self.accel_vector

And the gyro and accel streams interact with this class in their own threads, since the gyro and accelerometer data come at different rates:

def handle_gyro():
    while True:
        frames = gyro_pipeline.wait_for_frames()
        rotation_estimator.process_gyro_data(frames[0])

def handle_accel():
    while True:
        frames = accel_pipeline.wait_for_frames()
        rotation_estimator.process_accel_data(frames[0])

This seems to be working pretty well for me so far, but maybe there's a better way to do it.

@MartyG-RealSense
Copy link
Collaborator

I do not have a better solution to suggest than the one that you have implemented. Thanks so much for sharing your solution for the benefit of the RealSense community!

@uzgit
Copy link
Author

uzgit commented Apr 9, 2023

I have made a bit more progress on this. Lots of credit and thanks to @l0e42! :)

If we augment the solution above (which handles only the pitch), we can pretty easily handle the roll as well. The following are just notes and are meant as a sketch of the idea, not as something rigorous, and it may not be perfect. The image below shows the effect of rotating the camera by an angle $\theta_r$ on a pixel $(u,v)$.

In the previous solution, I considered only $h_1 = v - c_v$. If $\theta_\mathrm{v,p} = \frac{\theta_\mathrm{vfov}}{v_\mathrm{max}}$ is the change in pitch per vertical pixel offset from the center $(c_u, c_v)$, then the pitch for calculating the height of an object at pixel $(u,v)$ is $\theta_{p,(u,v)} = \theta_p + h_1\theta_\mathrm{v,p}$.

If we consider roll, we should use $h_2$, which incorporates the roll to calculate a new sort of "apparent pitch." We still include $h_1$ using the angle $\omega$, and we calculate $h_2 = \sin( \omega + \theta_r ) \sqrt{ (u-c_u)^2 + (v-c_v)^2}$,
and we have that $\omega=\arctan\left( \frac{v-c_v}{u-c_u} \right)$. Then, the height of the object is $h=d \cos(\theta_p + h_2 \theta_\mathrm{v,p})$, where $d$ is the depth value at the pixel $(u,v)$ from the D455.

image

All of this leads to the following solution:

def find_contiguous_regions( depth_data, gradient_mask, camera_euler_ryp, intrinsics, gravity_vector ):

    # from rotation estimator
    camera_roll  = camera_euler_ryp[0]
    camera_pitch = camera_euler_ryp[2]

    # from https://www.intelrealsense.com/depth-camera-d455/, converted to radians
    hfov = 87 * numpy.pi / 180.0
    vfov = 58 * numpy.pi / 180.0
    theta_v = vfov / intrinsics.height

    # this is the actual method
    u = numpy.linspace( 0, intrinsics.width  - 1, intrinsics.width )
    u = numpy.tile(u, (intrinsics.height, 1))
    u -= intrinsics.width/2

    v = numpy.linspace( 0, intrinsics.height - 1, intrinsics.height)
    v = numpy.tile(v, (intrinsics.width,  1)).T
    v -= intrinsics.height/2
    v *= -1

    r = numpy.sqrt( numpy.power(u, 2) + numpy.power(v, 2) )
    arctan_vu = numpy.arctan2(v,u)

    pitch = v * theta_v
    pitch += camera_pitch

    h2 = r * numpy.sin( camera_roll + arctan_vu )
    height = depth_data * numpy.cos( camera_pitch + h2*theta_v )

    # these areas have already been eliminated by their gradient
    height[ gradient_mask ] = 0

    # flood fill with tolerance, starting from middle of image: (u,v) are reversed in this function
    # then convert to boolean mask
    regions = flood(height, (240, 320), tolerance=0.3)
    regions = regions.astype(bool)

    # visualize
    output = height
    output = process_for_visualization(output)
    cv2.imshow("output2", output)

    return regions, ""

And this solution gives the following output:

d455_pitch_roll

It is a bit noisy and sensitive to quick changes in the orientation of the camera, but it is able to correctly identify most of the floor while avoiding other non-contiguous surfaces. The performance will likely improve if the camera is stabilized (instead of handheld like it is here), and also if the orientation is filtered in a better way than this exponential decay method. I also had to make $\alpha$ pretty low to get it to respond more quickly. Could also do some sort of temporal filter on the pixel classifications.

@MartyG-RealSense
Copy link
Collaborator

Thanks again for sharing your progress in such deep detail :)

@uzgit
Copy link
Author

uzgit commented Apr 11, 2023

After some post-processing and essentially using a different kind of flood fill technique, the results are smoother (see below). Ultimately, it seems like the data from the D455 is less reliable when it is viewing the ground at extreme angles, which is understandable. There is probably also some loss in precision in the data as it is being transformed, especially at these extreme angles. I'll close this issue now but thanks for the suggestions @MartyG-RealSense and if anyone has any ideas or more suggestions, please let me know!

d455_v207_smoothed

@uzgit uzgit closed this as completed Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants