## The Principle: Trigonometry
The idea here is simple, given (accurate, a big ask) metric depth values for every pixel's ray, allows us to calculate the length of any line segment in the 3D world which has its endpoints visible, using trigonometry. The depth values for the corresponding pixels give us the length of two sides of the triangle, and UniDepth's dense camera prediction directly gives us the angle between the two lines (without having to figure out the FOV).

Armed with the length of two sides and the measure of their contained angle, I'm 99.999% sure we can compute the third side, although I've never been very good at trigonometry and keep forgetting the law of cosines.

Unidepth makes this even easier for us, since it predicts rays completely using its pseudo-spherical output space, it can directly give us the world points corresponding to each pixel (calculated using its predicted camera parameters), which we can just calculate the euclidean distance between.

In [1]:
from PIL import Image
import depth_pro

In [2]:
import torch

dev = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dev

device(type='cuda', index=0)

In [3]:
model, transform = depth_pro.create_model_and_transforms(device=dev)
_ = model.eval()

  state_dict = torch.load(config.checkpoint_uri, map_location="cpu")


First, I generate depth maps + camera preds for each guy

In [4]:
import os
import PIL.Image as Image
import matplotlib.pyplot as plt
from tqdm import tqdm

import gc

import numpy as np

IMG_DIR = "data"
OUT_DIR = "depthpro_out"

depths = {}
focals = {}

for filename in tqdm([s for s in os.listdir(IMG_DIR) if s.endswith(".jpg")]):
	name = os.path.splitext(filename)[0]
	
	img, _, f_px = depth_pro.load_rgb(os.path.join(IMG_DIR, filename))
	img = transform(img).to(dev)

	preds = model.infer(img)

	depth = preds["depth"].cpu().numpy()
	
	depths[name] = depth
	focals[name] = preds["focallength_px"].cpu().item()

	plt.imsave(os.path.join(OUT_DIR, f"{name}.png"), depth, cmap="gray")

torch.cuda.empty_cache()
gc.collect()

100%|██████████| 24/24 [09:16<00:00, 23.20s/it]


29

In [10]:
np.savez(f"{OUT_DIR}/depths.npz", **depths)
np.savez(f"{OUT_DIR}/focals.npz", **focals)

In [11]:
focals = np.load(f"{OUT_DIR}/focals.npz")

focals["kartripta1"]

array(2718.80249023)

Before we begin, I want to talk about one neat benefit that Criminisi's method has over this one. Criminisi does not actually use any image data. It is a purely geometric derivation, and therefore, is not fazed by visual characteristics such as transparency, lighting conditions, etc etc. It is only concerned with projective invariants, and the only "visual" aspect of it is for identification of the keypoints.

On the other hand, depth estimation, predictably, is very sensitive to these image characteristics, since it has nothing else to go off of. This results in outputs like these:

<center>
	<img src = "data/kartripta9.jpg" style="width: 30%">
	<img src = "depthpro_out/kartripta9.png" style="width: 30%">
</center>

Clearly, it seems to register some of the glass wall as an actual solid wall, which means that this method won't work on this image. For Criminisi, however, this image is an ideal case, with the image plane at a high inclination angle to the world axis, we see it performing extremely well.