<a href="https://colab.research.google.com/github/ArjunNarayan2066/CS484_project/blob/fixed_training_loop/cs484_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Self-supervised depth estimation consists on training a neural network to provide dense pixel-wise depth predictions given a single image. "Self-supervision"  in the training process implies not relying on ground truth depth maps, which can be hard and expensive to acquire. This project is motivated by Digging Into Self-Supervised Monocular Depth Estimation [1].

The network attempts to generate a synthetic view of the same scene from a different point of view. This implicitly generated a disparity map, if per-pixel correspondences are known. The relative pose of the virtual viewpoint with respect to the actual viewpoint (i.e. real camera) is represented as $T_{t' \rightarrow t}$, and the synthetic image is denoted $I_{t'}$, given an original image $I_t$.

The loss functions employed in the training are as follows. 
$$L_p = \Sigma_{t'} pe( I_{t}, I_{t'\rightarrow t})$$

where $pe$ represents the $L1$ distance between matching pixels in pixel space and is defined as:

$$pe(I_a, I_b) = \frac{\alpha}{2}(1-SSIM(I_a, I_b)) + (1-\alpha) ||I_a - I_b||_1$$

$\alpha$ is set to 0.85.

$SSIM$ represents structural similarity between images $I_a$ and $I_b$. That is, the sum of x and y distances for all matches:

$$SSIM(x, y) = \frac{(2\mu_x\mu_y+c_1)(2\sigma_{xy}+c_2)}{(\mu_x^2+\mu_y^2+c_1)(\sigma_x^2+\sigma_y^2+c_2)}$$

$I_{t'\rightarrow t}$ is defined as follows:

$$I_{t'\rightarrow t} = I_{t'}\langle proj(D_t, T_{t\rightarrow t'}, K)\rangle$$

for $D_t$ being projected depths, $\langle . \rangle$ being the sampling operator.

Note that the $pe$ function provides a weighted sum of visually perceptible differences between images (SSIM), and the L1 norm between said images, as described in [2].




In [None]:
! nvcc --version

! pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
! pip install tensorboardX==1.4
! pip install opencv-python==3.3.1.11
! pip install kornia

# Clone repo
! git clone https://github.com/ArjunNarayan2066/monodepth2.gitpth2.git

In [None]:
%matplotlib inline

import os
import numpy as np
import PIL.Image as pil
import matplotlib.pyplot as plt
from PIL import Image
import kornia

import torch
from torchvision import transforms

import networks
from utils import download_model_if_doesnt_exist

In [None]:
class SSIM(nn.Module):
    def __init__(self):
        super(SSIM, self).__init__()
        self.mu_x_pool   = nn.AvgPool2d(3, 1)
        self.mu_y_pool   = nn.AvgPool2d(3, 1)
        self.sig_x_pool  = nn.AvgPool2d(3, 1)
        self.sig_y_pool  = nn.AvgPool2d(3, 1)
        self.sig_xy_pool = nn.AvgPool2d(3, 1)

        self.refl = nn.ReflectionPad2d(1)

        self.C1 = 0.01 ** 2
        self.C2 = 0.03 ** 2

    def forward(self, x, y):
        x = self.refl(x)
        y = self.refl(y)

        mu_x = self.mu_x_pool(x)
        mu_y = self.mu_y_pool(y)

        sigma_x  = self.sig_x_pool(x ** 2) - mu_x ** 2
        sigma_y  = self.sig_y_pool(y ** 2) - mu_y ** 2
        sigma_xy = self.sig_xy_pool(x * y) - mu_x * mu_y

        SSIM_n = (2 * mu_x * mu_y + self.C1) * (2 * sigma_xy + self.C2)
        SSIM_d = (mu_x ** 2 + mu_y ** 2 + self.C1) * (sigma_x + sigma_y + self.C2)

        return torch.clamp((1 - SSIM_n / SSIM_d) / 2, 0, 1)a


In [None]:
! wget -i kitti_archives_to_download.txt -P kitti_data/

! unzip -o "kitti_data/*.zip" -d "kitti_data/"

# Convert to jpg
! sudo apt update
! sudo apt install imagemagick
! sudo apt install parallel
! find kitti_data/ -name '*.png' | parallel 'convert -quality 92 -sampling-factor 2x2,1x1,1x1 {.}.png {.}.jpg && rm {}'


Generate a list of all the downloaded files, append $l$ or $r$ corresponding to which one of the cameras in the stereo pair it was taken on.


In [None]:
! find . -type f -name '*.jpg' >> all_files.txt

In [None]:
fp = open('./all_files.txt', 'r')
all_lines = fp.readlines()
all_lines = [x.strip() for x in all_lines]


##########################################################
fp2 = open('./eigen_all_files.txt', 'r')
eigen_lines = fp2.readlines()
eigen_lines = [x.strip() for x in eigen_lines]

output_file = open("output_files_train.txt", "w")


eigen_dict = {}

for line in eigen_lines:
    filepath, number, RL = line.split(' ')
    number = str(int(number))

    key = ' '.join([filepath, number])

    eigen_dict[key] = RL

for i, _ in enumerate(all_lines):
    first_part = '/'.join( all_lines[i].split('/')[2:-3] )
    second_part = str(int(all_lines[i].split('/')[-1].split('.')[0])) #remove trailing zeros

    joined = ' '.join([first_part, second_part])
    
    if joined in eigen_dict:
        output_file.write(' '.join([joined, eigen_dict[joined]]))
        output_file.write('\n')

output_file.close()


In [None]:
import torch
! uname -a
print(torch.cuda.is_available())
print(torch.cuda.get_device_name())

In [None]:
# ! python monodepth2/train.py --model_name stereo_model  --frame_ids 0 --use_stereo --split eigen_full

In [None]:
# ! find kitti_data/** -name '*.png' | parallel 'convert -quality 92 -sampling-factor 2x2,1x1,1x1 {.}.png {.}.jpg && rm {}'

In [None]:
# ! wget https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/2011_09_28_calib.zip
# ! wget https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/2011_09_28_drive_0001/2011_09_28_drive_0001_sync.zip
# ! wget https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/2011_09_28_drive_0002/2011_09_28_drive_0002_sync.zip



In [None]:
# ! rm -rf kitti_data
# ! mkdir kitti_data
# ! unzip -q 2011_09_28_drive_0001_sync.zip -d kitti_data
# ! rm -rf 2011_09_28_drive_0001_sync.zip
# ! mv data_temp/2011_09_26/2011_09_26_drive_0095_sync/* kitti_data

In [None]:
# ! sudo apt update
# ! sudo apt install imagemagick --fix-missing
# ! convert -h
# ! find kitti_data/ -name '*.png'
# ! sudo apt install parallel
# convert -quality 92 -sampling-factor 2x2,1x1,1x1 kitti_data/2011_09_26/2011_09_26_drive_0048_sync/image_02/data/0000000005.png jpg && rm {}
# ! find kitti_data/2011_09_28 -name '*.png' | parallel 'convert -quality 92 -sampling-factor 2x2,1x1,1x1 {.}.png {.}.jpg && rm {}'

In [None]:
! python monodepth2/train.py --model_name S_640x192 --frame_ids 0 --use_stereo --pose_model_type separate_resnet --split eigen_full --data_path /content/kitti_data --num_epochs 10
# /content/kitti_data/2011_09_26/2011_09_26_drive_0106_sync/image_02/data/0000000115.png

In [None]:

import numpy as np
import torch
import torch.nn as nn

from collections import OrderedDict
import layers
from layers import *


class DepthDecoder(nn.Module):
    def __init__(self, num_ch_enc, num_channels_out=1):
        super(DepthDecoder, self).__init__()

        self.num_channels_out = num_channels_out
        self.scales = [0,1,2,3]

        self.num_ch_enc = num_ch_enc
        self.num_ch_dec = np.array([16, 32, 64, 128, 256])

        # decoder   
        self.convs = OrderedDict()

        ### 4 ###
        # upconv_0
        num_ch_in = self.num_ch_enc[-1]
        num_ch_out = self.num_ch_dec[4]
        self.convs[("upconv", 4, 0)] = layers.ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[4]
        #skip connection
        num_ch_in += self.num_ch_enc[3]
        num_ch_out = self.num_ch_dec[4]
        self.convs[("upconv", 4, 1)] = layers.ConvBlock(num_ch_in, num_ch_out)

        ### 3 ###
        # upconv_0
        num_ch_in = self.num_ch_dec[4]
        num_ch_out = self.num_ch_dec[3]
        self.convs[("upconv", 3, 0)] = layers.ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[3]
        #skip connection
        num_ch_in += self.num_ch_enc[2]
        num_ch_out = self.num_ch_dec[3]
        self.convs[("upconv", 3, 1)] = layers.ConvBlock(num_ch_in, num_ch_out)

        ### 2 ###
        # upconv_0
        num_ch_in = self.num_ch_dec[3]
        num_ch_out = self.num_ch_dec[2]
        self.convs[("upconv", 2, 0)] = layers.ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[2]
        #skip connection
        num_ch_in += self.num_ch_enc[1]
        num_ch_out = self.num_ch_dec[2]
        self.convs[("upconv", 2, 1)] = layers.ConvBlock(num_ch_in, num_ch_out)

        ### 1 ###
        # upconv_0
        num_ch_in = self.num_ch_dec[2]
        num_ch_out = self.num_ch_dec[1]
        self.convs[("upconv", 1, 0)] = layers.ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[1]
        #skip connection
        num_ch_in += self.num_ch_enc[0]
        num_ch_out = self.num_ch_dec[1]
        self.convs[("upconv", 1, 1)] = layers.ConvBlock(num_ch_in, num_ch_out)

        ### 0 ###
        # upconv_0
        num_ch_in = self.num_ch_dec[1]
        num_ch_out = self.num_ch_dec[0]
        self.convs[("upconv", 0, 0)] = layers.ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[0]
        # No skip connection on the last layer
        num_ch_out = self.num_ch_dec[0]
        self.convs[("upconv", 0, 1)] = layers.ConvBlock(num_ch_in, num_ch_out)



        for s in self.scales:
            self.convs[("dispconv", s)] = layers.Conv3x3(self.num_ch_dec[s], self.num_channels_out)

        self.decoder = nn.ModuleList(list(self.convs.values()))
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_features):
        self.outputs = {}

        x = input_features[-1]

        ### 4 ###
        x = self.convs[("upconv", 4, 0)](x)
        x = [upsample(x)]
        x += [input_features[3]]
        x = torch.cat(x, 1)
        x = self.convs[("upconv", 4, 1)](x)

        ### 3 ###
        x = self.convs[("upconv", 3, 0)](x)
        x = [upsample(x)]
        x += [input_features[2]]
        x = torch.cat(x, 1)
        x = self.convs[("upconv", 3, 1)](x)
        self.outputs[("disp", 3)] = self.sigmoid(self.convs[("dispconv", 3)](x))

        ### 2 ###
        x = self.convs[("upconv", 2, 0)](x)
        x = [upsample(x)]
        x += [input_features[1]]
        x = torch.cat(x, 1)
        x = self.convs[("upconv", 2, 1)](x)
        self.outputs[("disp", 2)] = self.sigmoid(self.convs[("dispconv", 2)](x))

        ### 1 ###
        x = self.convs[("upconv", 1, 0)](x)
        x = [upsample(x)]
        x += [input_features[0]]
        x = torch.cat(x, 1)
        x = self.convs[("upconv", 1, 1)](x)
        self.outputs[("disp", 1)] = self.sigmoid(self.convs[("dispconv", 1)](x))

        ### 0 ###
        x = self.convs[("upconv", 0, 0)](x)
        x = [upsample(x)]
        x = torch.cat(x, 1)
        x = self.convs[("upconv", 0, 1)](x)
        self.outputs[("disp", 0)] = self.sigmoid(self.convs[("dispconv", 0)](x))

        return self.outputs

## Compute gradient of image to give Loss function different weights

In [None]:
def get_gradient(t, gauss_blur_sigma=None, kernel_size = 5):
  a = torch.Tensor([[1, 0, -1],
  [2, 0, -2],
  [1, 0, -1]])

  a = a.view((1,1,3,3))
  G_x = F.conv2d(t, a)

  b = torch.Tensor([[1, 2, 1],
  [0, 0, 0],
  [-1, -2, -1]])

  b = b.view((1,1,3,3))
  G_y = F.conv2d(t, b)

  G = torch.sqrt(torch.pow(G_x,2)+ torch.pow(G_y,2))


  if gauss_blur_sigma:
    G = kornia.gaussian_blur2d(G, (kernel_size, kernel_size), (gauss_blur_sigma, gauss_blur_sigma))

  G -= torch.min(G)
  G /= torch.max(G)


  return G

### Testing the gradient function:

In [None]:

# Read in image
im = Image.open("./kitti_data/2011_09_28/2011_09_28_drive_0002_sync/image_00/data/0000000172.png")

# Convert to tensor and compute gradient
t = transforms.ToTensor()
t_im = t(im)[None]
t_im.shape
grad = get_gradient(t_im, gauss_blur_sigma=2)


t2 = transforms.ToPILImage()
g = t2(grad[0])

g


In [None]:
import os
import torchvision.models as models

def save_model(train):
    """Save model weights to disk
    """
    save_folder = os.path.join("/content/test_model/")
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)

    for model_name, model in train.models.items():
        save_path = os.path.join(save_folder, "{}.pth".format(model_name))
        to_save = model.state_dict()
        if 'encoder' in model_name:
            # save the sizes - these are needed at prediction time
            to_save['height'] = train.h
            to_save['width'] = train.w
            to_save['use_stereo'] = True
        torch.save(to_save, save_path)

    save_path = os.path.join(save_folder, "{}.pth".format("adam"))
    torch.save(train.adam_optim.state_dict(), save_path)

In [None]:
#@title Default title text

###
###
### Training code is based on a very-stripped down version of https://arxiv.org/pdf/1806.01260.pdf
### Written only with stereo based training with reduced feature set
###
###


import monodepth2.layers, monodepth2.utils, monodepth2.trainer
import monodepth2.networks

import torch
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import timeit
import PIL.Image as pil


class MyTraining(object):
    def __init__(self, batch, epoch):
        self.batch = batch
        self.epoch_max = epoch
        self.h = 192
        self.w = 640
        if torch.cuda.is_available():
            self.device = torch.device("cuda")
        else:
            self.device = torch.device("cpu")

        # For listing which frames to check
        # For simple stereo is only current (0) and estimated pair (s)
        self.frames = [0, 's']

        # For multi-scale estimation
        self.loss_scales = [0, 1, 2, 3]

        # Weighting for loss function
        self.alpha = 0.85

        # Define projections for pose estimation
        self.depth_projection = {}
        self.projection_3d = {}
        for loss_scale in self.loss_scales:
            h = int(self.h / (2 ** loss_scale))
            w = int(self.w / (2 ** loss_scale))

            self.depth_projection[loss_scale] = monodepth2.layers.BackprojectDepth(self.batch, h, w).to(self.device)
            self.projection_3d[loss_scale] = monodepth2.layers.Project3D(self.batch, h, w).to(self.device)
        
        ## Declare Depth Network
        self.depth_encoder_network = monodepth2.networks.ResnetEncoder(18, True).to(self.device)
        self.depth_decoder_network = monodepth2.networks.DepthDecoder(self.depth_encoder_network.num_ch_enc, 
                                                [0, 1, 2, 3]).to(self.device) 

        ## Declare Pose Network
        self.pose_encoder_network = monodepth2.networks.ResnetEncoder(18, True, num_input_images=2).to(self.device)
        self.pose_decoder_network = monodepth2.networks.PoseDecoder(self.pose_encoder_network.num_ch_enc, 
                                                num_input_features=1, num_frames_to_predict_for=2).to(self.device)

        self.models = {"encoder": self.depth_encoder_network, "depth": self.depth_decoder_network,
                       "pose_encoder": self.pose_encoder_network, "pose": self.pose_decoder_network}

        # Set up our excerpt of Kitti Dataset
        training_files = monodepth2.utils.readlines("/content/monodepth2/splits/eigen_full/train_files.txt")
        validation_files = monodepth2.utils.readlines("/content/monodepth2/splits/eigen_full/val_files.txt")

        train_set = monodepth2.datasets.kitti_dataset.KITTIRAWDataset("/content/kitti_data", training_files, self.h, self.w, self.frames, 4, is_train=True, img_ext='.jpg')
        self.train_loader = DataLoader(train_set, self.batch, True, num_workers=6, pin_memory=True, drop_last=True)
        val_set = monodepth2.datasets.kitti_dataset.KITTIRAWDataset("/content/kitti_data", validation_files, self.h, self.w, self.frames, 4, is_train=False, img_ext='.jpg')
        self.val_loader = DataLoader(val_set, self.batch, True, num_workers=6, pin_memory=True, drop_last=True)

    def run_training_loop(self):
        self.all_params = list(self.depth_encoder_network.parameters())
        self.all_params += list(self.depth_decoder_network.parameters())
        self.all_params += list(self.pose_encoder_network.parameters())
        self.all_params += list(self.pose_decoder_network.parameters())

        # Same learning rate configuration as https://arxiv.org/pdf/1806.01260.pdf
        self.adam_optim = optim.Adam(self.all_params, 1e-4)
        self.lr_sched = optim.lr_scheduler.StepLR(self.adam_optim, 15, 0.1)
        self.ssim_loss_func = monodepth2.layers.SSIM().to(self.device)

        self.epoch_count = 0
        self.losses  = []

        for self.epoch_count in range(self.epoch_max):
            # Per Epoch
            self.lr_sched.step()
            self.depth_encoder_network.train()
            self.depth_decoder_network.train()
            self.pose_encoder_network.train()
            self.pose_decoder_network.train()

            for idx, inputs in enumerate(self.train_loader):
                # Push to GPU
                for key, ipt in inputs.items():
                    inputs[key] = ipt.to(self.device)

                # Per Batch Code
                batch_start_time = timeit.default_timer()

                feature_identifications = self.depth_encoder_network(inputs["color_aug", 0, 0])
                outputs = self.depth_decoder_network(feature_identifications)

                self.estimate_stereo_predictions(inputs, outputs)
                loss = self.batch_loss_func(inputs, outputs)

                self.adam_optim.zero_grad()
                loss["loss"].backward()
                self.losses.append(loss["loss"].item())
                self.adam_optim.step()

                batch_duration = timeit.default_timer() - batch_start_time

                # Do something with batch duration and losses
            
            print("Finished Epoch: {} with Loss {}".format(self.epoch_count, self.losses[-1]))

    def estimate_stereo_predictions(self, inputs, outputs):
        # Generate estimated stereo pair using the pose networks
        for loss_scale in self.loss_scales:
            estimated_disparity = outputs[("disp", loss_scale)]
            disp = F.interpolate(estimated_disparity, [self.h, self.w], mode="bilinear", align_corners=False)
            base = 0 # base scale

            # Convert sigmoid disparity to depth estimate
            scaled_disp = 0.001 + (10 - 0.001) * disp
            depth = 1 / scaled_disp
            outputs[("depth", 0, loss_scale)] = depth

            # Finalize Stereo Estimates
            # Fetch camera extrinsics generated by KITTIRAWDataset
            extrinsics = inputs["stereo_T"]

            camera_coords = self.depth_projection[base](depth, inputs[("inv_K", base)])
            pixel_coords = self.projection_3d[base](camera_coords, inputs[("K", base)], extrinsics)

            outputs[("sample", 's', loss_scale)] = pixel_coords
            outputs[("color", 's', loss_scale)] = F.grid_sample(
                inputs[("color", 's', base)], outputs[("sample", 's', loss_scale)], padding_mode="border")
                
    # Compute SSIM and L1 Loss
    def reprojection(self, prediction, target):
        ssim_loss = self.ssim_loss_func(prediction, target).mean(1, True)
        reproj_losses = self.alpha*ssim_loss + (1-self.alpha)*(torch.abs(target-prediction).mean(1, True))
        return reproj_losses

    def batch_loss_func(self, inputs, outputs):
        batch_loss = {}
        complete_losses = 0

        for loss_scale in self.loss_scales:
            reproj_losses = []

            reproj_losses.append(self.reprojection(outputs[("color", 's', loss_scale)], inputs[("color", 0, 0)]))
            reproj_losses = torch.cat(reproj_losses, 1)

            identity_loss = []
            identity_loss.append(self.reprojection(inputs[("color", 's', 0)], inputs[("color", 0, 0)]))
            identity_loss = torch.cat(identity_loss, 1)


            # Add some minor noise to ensure no repeated values
            identity_loss += torch.rand(identity_loss.shape).cuda() * 0.00001
            total = torch.cat((identity_loss, reproj_losses), dim=1)
            val, idxs = torch.min(total, dim=1)
            outputs["identity_selection/{}".format(loss_scale)] = (idxs > identity_loss.shape[1] - 1).float()

            current_loss = val.mean()

            mean_disp = outputs[("disp", loss_scale)].mean(2, True).mean(3, True)
            norm_disp = outputs[("disp", loss_scale)] / (mean_disp + 1e-7)
            smooth_loss = monodepth2.layers.get_smooth_loss(norm_disp, inputs[("color", 0, loss_scale)])

            current_loss += 1e-3 * smooth_loss / (2 ** loss_scale)
            complete_losses += current_loss
            batch_loss["loss/{}".format(loss_scale)] = current_loss

        complete_losses /= len(self.loss_scales)
        batch_loss["loss"] = complete_losses
        return batch_loss

In [None]:
train = MyTraining(24, 10)
train.run_training_loop()
save_model(train)

In [None]:
# Simple test on car image
! python monodepth2/test_simple.py --image_path monodepth2/assets/test_image.jpg --model_name mono+stereo_640x192 --model_path /content/test_model/

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# my_losses = np.copy(train.losses)

plt.plot(range(len(my_losses)), my_losses, 'b-')
plt.show()

In [None]:
! python monodepth2/export_gt_depth.py --data_path kitti_data --split eigen

In [None]:
! python monodepth2/evaluate_depth.py --data_path kitti_data --load_weights_folder /content/test_model/ --eval_stereo

# References

[1] https://arxiv.org/pdf/1806.01260.pdf

[2] https://arxiv.org/pdf/1511.08861.pdf