## Poking around the World State

To train a robotic hand to interact with the physical world, we require a reliable 3D reconstruction of the scene. Therefore, I focused on these three attribute and listed their corresponding reason why I found them vital.

1. **Hand Pose Keypoints**: They are essential for us to reconstruct hand geometry and articulation. In other words, this help us "build a hand" in our recontrusction. 
2. **Depth Estimation**: TIn addition to hand pose keypoints, we still need to localize the hand and surrounding environment in 3D space, depth estimation will be of great importance in this stage.
3. **Object Detection**: Due to time limit, I didn't explore this area in detail. Howeveer, I can tell thhat it will help us identify objects and potentially attach semantic or physical properties, enabling reasoning about hand–object interactions and future extensions.


### Handpose Keypoints

We use MediaPipe to perform hand pose tracking and implement a HandposeMarker class that supports annotating hand keypoints and skeletal connections on both images and videos.

In [2]:
# demo
from scripts.HandposeMarker import HandposeMarker
from scripts.HandposeMarker import demo as handpose_demo

def main():
    handpose_demo()

if __name__ == "__main__":
    main()
    pass

SCRIPT_DIR: d:\NYU_Files\2026_Spring\summer_research\NYU UGSRP\EMBODY_INTELL_EXPORT\scripts
IMG_PATH: d:\NYU_Files\2026_Spring\summer_research\NYU UGSRP\EMBODY_INTELL_EXPORT\scripts\..\src\imgs\hand.jpg
VIDEO_PATH: d:\NYU_Files\2026_Spring\summer_research\NYU UGSRP\EMBODY_INTELL_EXPORT\scripts\..\src\videos\1.mp4
Annotated image saved


Processing frames: 100%|██████████| 171/171 [00:13<00:00, 12.98it/s]

Handpose video saved.








<div style="display: flex; gap: 10px;">
    <img src="src/imgs/hand.jpg" width="50%">
    <img src="src/imgs/hand_handpose.jpg" width="50%">
</div>



A major limitation we observe is that hand pose estimation becomes unreliable under occlusion, where self-occlusion or object occlusion often leads to inaccurate keypoints or missing the entire hand.

### Depth Estimation (Monocular)

We investigate both relative depth estimation with Depth Anything and absolute depth estimation with UniDepth, but face practical deployment challenges with UniDepth.

#### DepthAnything

In [1]:
from scripts.DepthAnythingDemo import main as depth_anything_demo
depth_anything_demo()

Loading weights:   0%|          | 0/287 [00:00<?, ?it/s]

Image saved to d:\NYU_Files\2026_Spring\summer_research\NYU UGSRP\EMBODY_INTELL_EXPORT\scripts\..\src\imgs\cat.png
Depth map saved to d:\NYU_Files\2026_Spring\summer_research\NYU UGSRP\EMBODY_INTELL_EXPORT\scripts\..\src\imgs\cat_relative_depth.png


<div style="display: flex; gap: 10px;">
    <img src="src/imgs/cat.png" width="50%">
    <img src="src/imgs/cat_relative_depth.png" width="50%">
</div>

The model produces reasonable relative depth maps that capture coarse scene geometry, but still exhibits limitations in fine-grained detail, and its performance is subject ot both lighting condition and the scale of model used. 

To illustrate this effect, we present a simple, non-rigorous experiment showing how additional illumination influences relative depth predictions.

<div style="display: flex; gap: 10px;">
    <img src="src/imgs/img_a.jpg" width="50%">
    <img src="src/imgs/depth_a.png" width="50%">
</div>
<div style="display: flex; gap: 10px;">
    <img src="src/imgs/img_b.jpg" width="50%">
    <img src="src/imgs/depth_b.png" width="50%">
</div>

#### Unidepth

Although UniDepth V2 is well-suited for metric depth estimation, practical deployment is hindered by dependency and toolchain issues in our current Windows-based environment.

When attempting to use UniDepthV2 for depth estimation, I initially overlooked its implicit assumptions about the dependencies. This project is primarily designed for Linux platforms and relies on a series of CUDA extensions and third-party high-performance operator libraries. When trying to use it on Windows 11, I had to work out a combination of Visual Studio, CUDA Toolkit, and related libary toolchains to meet these dependencies, which introduced significant environmental complexity and triggered some system-level conflicts. For example, it had a negative impact on my Monogame development environment, and I took trouble addressing it. Furthermore, inspired by my work on object detection, I introduced Cupy to take to place of Numpy, so as to reduce CPU-GPU data transfer latency. However, this further increased debugging and maintenance costs on this platform. After a comprehensive evaluation, I chose to abandon this approach and revert to a more stable and controllable implementation. To sum up, the challenges stemmed from not the model itself, but the mismatch between software ecosystem and rapidly evolving hardware architectures.

In [11]:
# be sure to install Unidepth into your environment!
from scripts.UniDepthMarker import demo as unidepth_demo
unidepth_demo()


xFormers not available
xFormers not available


Not loading pretrained weights for backbone
EdgeGuidedLocalSSI reverts to a non cuda-optimized operation, you will experince large slowdown, please install it:  `cd ./unidepth/ops/extract_patches && bash compile.sh`




Saved demo to d:\NYU_Files\2026_Spring\summer_research\NYU UGSRP\EMBODY_INTELL_EXPORT\scripts\..\src\imgs/hand_unidepth.png


<div style="display: flex; gap: 10px;">
    <img src="src/imgs/hand.jpg" width="50%">
    <img src="src/imgs/hand_unidepth.png" width="50%">
</div>

Due to environment-related issues, part of the time was spent on setup and debugging. At the current stage, we are able to successfully visualize depth results. However, we hope that in the near future, this module can be combined with the previously implemented hand pose marker to produce records of hand keypoints moving in 3D space over time. Our goal is to reproduce a dataset that is similar to provided hdf5 files

### Object Detection

The object detection module is based on a YOLO model, but its current performance of both quality and efficiency is unsatisfactory. The exact causes have not yet been fully investigated, though the issue may be related to the model scale. At this stage, the current detection quality may be insufficient to reliably support downstream training for embodied intelligence tasks.

In [None]:
from scripts.ObjDetector import main as obj_detector_demo
obj_detector_demo()

<div style="display: flex; gap: 10px;">
    <img src="src/imgs/hand.jpg" width="50%">
    <img src="src/imgs/hand_obj_detection.png" width="50%">
</div>