# L7 Computer vision group project
## L7 CV Group - 2
### Ben Snow
### Nick Lindfield
### Ashavidya Kusuma

# Depth prediction using video game data for real-world applications.

# Definition of the problem
With the rise of autonomous vehicles, on-camera image processing and augmented reality (AR), there is an increasing need for accurate depth prediction
Current methods produce highly noisy results and lack detail, frequently failing to separate background and foreground objects. [14]


![image cannot be found](depth_comparison.png "Depth comaprison")
Figure 1. A comaprison of predicted and ground truth depth predictions from various neural network depth prediction algorithms.

Current uses for depth prediction/estimation are:
- Autonomous driving - predicting the distance of objects on the road, such as other motorists, pedestrians and cyclists.
- Image processing - blurring foreground subject from the background.
- AR - object occlusion, placing digital characters behind objects such as tables.


With this project, we aim to overcome the difficulties of collecting real-world data by using augmented data from video games, we will extract the depth buffer and RGB values to be used as our training data.


Some advantages and disadvantages of real-world and augmented data are:


Disadvantages of real-world data
- DIS - Expensive -transporting the equipment and staff to the location 
- DIS - Requires specialized equipment
- DIS - Weather condition can affect the accuracy of the data


Advantage of video game data
- ADV - Cheap and easy to come by
- ADV - A larger amount of data can be collected
- ADV - Control of lighting and weather conditions 
- ADV - Accurate depth in poor weather conditions easily
- ADV - Can create edge cases but these don’t happen naturally
- ADV - No transportation and logistical costs -- Saves the environment
- ADV - Procedural generation of datasets


Disadvantages of video game data


- DIS - May not be an accurate estimation of the real world
- DIS - Overfitting to similar environments
- DIS - Our data may be too perfect, real-world data can contain noise and other artefacts


To analyse our results we will test them against real-world datasets, such as DrivingStereo, KITTI.
We intend to build upon existing papers that obtain depth data from video games.  Existing solutions exist but they are not open source and freely available.


## Depth detection and video games
Advancement of the realistic nature of Computer-Generated (CG) environments could become the basis for training artificial agents. Instead of spending time driving around physical cars in the real world to collect data, a CG environment could be created to simulate the same data. Advantages of CG environments over real-world environments could be that data is much faster to produce, AI’s could learn faster and edge cases (broken down vehicles, towing caravans etc…) could be easily inserted into the simulation at will instead of waiting for them to happen in real life. Transfer learning could be exploited to utilise the training performed in the CG environment to the real-world.

Existing techniques are beginning to be used for training self-driving cars this way. Nvidia have created a virtual training environment for training driverless cars for Toyota called Nvidia Constellation. This system allows for the training, testing and evaluation of driverless AI in many thousands of different scenarios allowing for billions of miles of training before hitting the road.

Edge case simulations can easily be inserted into the CG environment. Existing scenes from previous CG projects and games can be directly used and included in the training set. This will be advantageous for the AI’s learning capabilities as exposure to as many scenarios as possible generates experience for real-world events. Taken further, simulations of natural disasters, traffic jams, piles ups and crashes can all be generated in CG without any real-world danger! Fully autonomous vehicles should be equipped with the experience to tackle these edge cases, should they need to deal with them.

Modern video game engines are capable of creating incredibly complex scene topologies in real time with physically based movements and interactions. An example of this being Grand Theft Auto 5 in which players are free to roam in a realistic rendering of a city, San Andreas, based on Los Angeles. Players can drive a multitude of cars, motorcycles and aircraft around the city on numerous roads and paths. This environment could be an ideal playground for building and training autonomous vehicles for use in the real world. One example of this is in a youtube series by user ‘Sentdex’ who attempted to use a Deep Convolutional Network based on AlexNet to drive on the streets of San Andreas. The model is available on github for download.

In a video game, depth field training data can be extracted directly from the video game engine and is called a z-buffer. This been demonstrated by Adrian Courrèges.

Existing techniques are available for constructing a depth map from a single frame and from binocular stereo frames. The referenced single frame method employs traditional computer vision techniques whereas the stereo method uses a CNN architecture.


#  Aims and objectives
We aim to generate a dataset consisting of RGB and Depth channels in a wide range of environments and weather conditions.
We aim to create and evaluate depth estimation algorithms, utilising PyTorch, OpenCV and a combination of traditional computer vision methods, finally evaluating our results on a range of testing datasets.
The following measurable objectives have been identified:
- Generate a synthetic image and depth dataset from the video game GTA V
- Use data augmentation techniques to increase the amount of data
- Use a driving bot and mods to autonomously collect data in different weather conditions and environments
- Create a traditional (OpenCV) depth prediction algorithm
- Create a neural network only depth prediction algorithm
- Create a depth prediction algorithm neural network trained on data augmented with OpenCV techniques
- Evaluate the depth prediction model on various real-world and virtual testing datasets including the accepted standard KITTI dataset and a new, more varied dataset, DrivingStereo (Feb 2020).
- Compare our model to the model found in our references below, as they have already trained a model on a narrow video game dataset.


# Data acquisition and preparation

## GTA V

GTA V is a closed source video game meaning that there is no direct access to the source code available. As a result, the GTA V modding community has found multiple different ways of accessing parts of the rendering pipeline. These methods typically rely on injecting a 'DirectX11 driver' into the game before frame drawing time to intercept data used in the rendering stage. It is here that the depth information is stored.
Data collection was split into 3 parts at the beginning of the project and are defined below.

Simple collection
- Extract order 10 RGB and depth image pairs from GTAV
- Drive around in one environment with constant weather conditions, no occlusions and no data augmentation
- Data collected so that initial models have data to work with

Moderate collection
- Use a bot to automatically drive around and drastically increase the dataset size
- Use GTA mods to alter the weather conditions and times of day
- Drive around new locations such as city and off-road
- Implement low-level data augmentation such as translations and reflections

Hard collection
- Better data augmentation
- Add in varied occlusions and domain adaptation
- Image style transfer from real-world data to synthetic data


## Simple collection

A repository called [GTAVisionExport](https://github.com/umautobots/GTAVisionExport) was used as a starting point to extract single frames of Colour, Depth and Stencil Depth from GTAV. Saved files are stored as .raw images and, as such, cannot be directly imported into python natively. The section 'Image formats' discusses this further.

Extraction code within the GTAVisionExport was altered to the following to extract the depth, RGB/colour image and stencil (not used) images. If the 'L' key is pressed, the depth, stencil and color buffers are written to the game file directory.

![image cannot be found](Simple_collection_cpp.png "Extraction code")
Figure 2: Extraction code written in C++ to output RGB, Depth and stencil depth images from GTAV.


GTAVisionExport gives all saved images the same filenames (color.raw, depth.raw and stencil.raw). Taking multiple screenshots overwrites the currently saved files. Moving these files out of the game directory then taking another screenshot allows multiple different screenshots to be saved. This is cumbersome and annoying. As such, the GTAVisionExport source code will be changed so that new files are saved with a timestamp and to a new folder for easy, more organised storage.

A full description of how to install GTAVisionExport can be seen in the [group google document here.](https://docs.google.com/document/d/1UcQl8Q-COs9_vZ65RKXD8DnIcmRnXzqsJBdfcmd6iJY/edit?usp=sharing)


As ‘Simple collection’ only requires on the order of 10 colour and depth images, the manual moving and renaming files technique was be used.

Example image outputs from Simple collection can be seen below:

![image cannot be found](sample_simple_rgb.png "Simple RGB")
Figure 3. Example of a colour screenshot extracted from in-game.

![image cannot be found](sample_depth_rgb.png "Simple depth")
Figure 4. Depth information is shown with colour gradients, the more yellow/red the item the closer it is and vice versa for blue and purple.

### Image formats

Images extracted from GTAV are stored as .raw files and as such, two functions, `import_raw_colour_image` and `import_raw_depth_image` were written to load the images into numpy arrays.

The shape of colour images are: 	 (720, 1280, 4)

And for depth images they are: 	     (720, 1280)
 
Colour images are read in as 'unit8' with 4 channels, RGBA
This means that for every one of the 720*1280 pixels there are 4 numbers that represent the Red, Green, Blue and Alpha channels in the image.

Depth images are read in as 'float32' with 1 channel, depth
Depth values of Zero relate to infinite depth in the scene and the larger the number the closer to the camera an object is
The following conversion formula can be used to convert GTAV depth to real world metres:



## Data output from simple collection


The resulting data from Simple collection consists of 6 colour images of resolution 720x1280 and 6 associated depth maps. Images were taken from within the default GTA V car, on the roads around the starting house, all were taken in the same, daytime lighting conditions with clear weather.


## Moderate collection

Simple collection relied on manually pressing a keyboard key to capture an RGB image and depth map pair, part of moderate collection is to automate this process. To achieve this, and autoclicker software [AutoHotKey](https://www.autohotkey.com/) was used. A simple script to press the capture key automatically every 600ms was written, this is visible via Ben's [github repo](https://github.com/BenSnow6/depth_estimation/blob/master/Data_Collection/Moderate%20collection/testScript.ahk.ahk).

In addition to this, the extraction code was altered to allow for multiple images to be outputted without overwriting previous images. The process is outlined below:
- On ‘l’ key press
- Open notepad file with a number stored in it
- Attach number to ‘depth.raw’ and ‘colour.raw’ strings
- Save the outputs in folders, one for depth and one for colour
- Increment number stored in notepad.

The following C++ code was written to achieve this.

![image not found](Moderate_collection_cpp.png "Moderate collection extraction code")
Figure 5. Extraction code written in C++ altered from the GTAVisionExport tool to add numeric labels to image filenames during collection.



## VAutodrive
In order to truly collect data autonomously the automatic driving modification called [VAutodrive](https://www.gta5-mods.com/scripts/vautodrive) (Five Auto drive) was downloaded and utilised. It is simple to use: get in a car, change the view to 1st person (by pressing ‘V’), go to the map, set a waypoint (double click on road) then press ctrl+J to start the autopilot. By default the pilot will drive at 25mph, obeying traffic laws and driving non erratically. After starting VAutopilot use the autohotkey script and press Ctrl+L to start it, this will start automatic collection.


## NativeUI
To change the weather conditions in GTAV, a plugin called NativeUI was used allowing access to an in-game menu. Open this menu with ‘F4’ and use the number pad to navigate it. In NativeUI the weather conditions can be changed, along with the time of day. These were altered and automatic collection was used to collect scenes of around 8000 frames in different weather conditions and times of day.


## Output from Moderate collection

Moderate collection resulted in 7857 colour and 7857 associated depth images extracted from GTAV in 5 different weather conditions. All images are stored in .raw files with resolutions of 720x1280 and a size of 3.6MB each. The output is summarised in Table 1 below.

Conditions___| Number of colour images____| Number of depth images____| Size of data (GigaBytes) 
---|:---:|:---:|:---:
Sunny | 500 | 500 | 3.43
Snowy | 1000 | 1000 | 6.86
Foggy_dark | 1100| 1100 | 7.55
Blizzard | 3000 | 3000 | 20.5
Rain_night | 2257 | 2257 | 15.4
Total | 7857 | 7857 | 53.9 |
Table 1. Characteristics of the data collected in moderate collection.

## File structure
Files are stored in folders with subfolders for Colour and Depth. Each colour file is named colour_0000x.raw where x is the frame number. For example, the range of file names in the ‘Sunny’ conditions are colour_00001.raw to colour_00500.raw and depth_00001.raw to depth_00500.raw.
Data was collected in the order of Table 1, above. It follows that the first colour filenames in the Snowy collection are colour_00501.raw to colour_01500.raw and the same for depth.
Images are stored to separate colour and depth directories and for each instance of new conditions, a new directory folder is made and used (Sunny/colour, Snowy/depth etc…). A full list of driving conditions along with file labels can be found under the ‘Conditions.txt’ in the base of the /Moderate_collection directory on [One Drive](https://livebournemouthac-my.sharepoint.com/:f:/g/personal/bsnow_bournemouth_ac_uk/EmXxIrfiQg1IlJkc18oqVbwBEm-4461czkd9OvFbvjH1UA?e=5YOjrU).



# Installs and imports

In [4]:
!pip install pickle-mixin

Collecting pickle-mixin
  Downloading pickle-mixin-1.0.2.tar.gz (5.1 kB)
Building wheels for collected packages: pickle-mixin
  Building wheel for pickle-mixin (setup.py): started
  Building wheel for pickle-mixin (setup.py): finished with status 'done'
  Created wheel for pickle-mixin: filename=pickle_mixin-1.0.2-py3-none-any.whl size=6002 sha256=606dfa158d75027c849e08d00715d939094ca5e3cc814bccdd3aa0b99fe0412e
  Stored in directory: c:\users\ben\appdata\local\pip\cache\wheels\2a\a4\6c\83bfbc3b94f1bb43d634b07a6a893fd437a45c58b29aea5142
Successfully built pickle-mixin
Installing collected packages: pickle-mixin
Successfully installed pickle-mixin-1.0.2


In [3]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import trange, tqdm
import pickle
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [4]:
import sys
sys.path.append('../../')
from Evaluation_procedure.eval_functions import isValid, get_depth, calc_errors, predict_and_gt, mean_and_std_errors

In [5]:
import torch
import torchvision
import pandas as pd
import numpy as np
from PIL import Image
from Functions import import_raw_colour_image, import_raw_depth_image, show_depth_image, show_img
import os
from os import walk
from skimage import io, transform
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
import warnings
warnings.filterwarnings("ignore")
from pathlib import Path
from os import listdir
from os.path import isfile, join
import torch.nn.functional as F
from torchsummary import summary
import math

plt.ion()   # interactive mode

## Reading in the csv data structure

A csv containing the details of the Moderate Collection data was created, this file is read in to two variables, `folder_names` and `num_files`. These will be used to create a list of filenames from which will then be used to construct the dataset class.

In [7]:
import csv
with open('..\data_descriptions.csv', newline='') as csvfile: ###### data_descriptions csv must be in this relative location
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
    count = 0
    for row in spamreader:
        if count == 0:
            folder_names = row ## store names of folders in the directory
        else:
            num_files = row ## store number of files within folder
        count = 1

In [8]:
for i in range(0,len(num_files)):
    num_files[i] = int(num_files[i]) ## convert number of files from string to int

Creating a list of filenames for use in the dataset class.

In [9]:
list_of_numbers = ["{0:05}".format(i) for i in range(1, sum(num_files)+1)]
colour_filenames = []
depth_filenames = []
for num in list_of_numbers:
    colour_filenames.append(f"colour_{num}.raw")  ## append formatted number labels to file names
    depth_filenames.append(f"depth_{num}.raw")

## Dataset class

Due to having over 60GB of images in the Moderate Collection dataset it is impossible to simultaneously load all of them into the 16GB of available memory. As such a custom PyTorch dataset class will be created along with a data loader allowing for batches of n images to be loaded into memory at one time.

The dataset class, named ModerateDataset, loads the images from a fixed directory. For this to work on a given computer, the ‘Path’ must be changed to the base directory of the Moderate Dataset.



In [6]:
class ModerateDataset(Dataset):

    def __init__(self, col_dir='', depth_dir='', transform=None, trans_on=False):
        self.path_names = {}
        for folder in folder_names:
            self.path_names[f"{folder}"] = {}
        for folder in folder_names:
            self.path_names[f'{folder}']['colour'] = {}
            self.path_names[f'{folder}']['depth'] = {}
        for i in range(1, num_files[0]):
            self.path_names['Sunny']['colour'][f"{i}"] = {}
            self.path_names['Sunny']['depth'][f"{i}"] = {}
        print("*************MAKE SURE THE PATH FILE IN THE FOR LOOP IS THE BASE IMAGE DIRECTORY ON YOUR COMPUTER**************")
        count = 0
        for folder in folder_names:
            for i in range(0, num_files[folder_names.index(folder)]):
                self.path_names[f'{folder}']['colour'][f'{i+1}'] = Path(f"C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Moderate collection/{folder}/colour/{colour_filenames[count+i]}")  ## Change this path here!!!!
                self.path_names[f'{folder}']['depth'][f'{i+1}'] = Path(f"C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Moderate collection/{folder}/depth/{depth_filenames[count+i]}")   ## Change this path here!!!!
            count = count + num_files[folder_names.index(folder)]
        
        self.transform = transform
        self.col_dir = col_dir
        self.depth_dir = depth_dir
        self.trans_on = trans_on

    def __getitem__(self,idx):
        if idx == 0:
            
            self.col_dir = self.path_names[f'{folder_names[0]}']['colour'][f'{idx+1}']
            self.depth_dir = self.path_names[f'{folder_names[0]}']['depth'][f'{idx+1}']
        
        if (idx>0 and idx <= num_files[0]):  ## 1-500

            self.col_dir = self.path_names[f'{folder_names[0]}']['colour'][f'{idx}']
            self.depth_dir = self.path_names[f'{folder_names[0]}']['depth'][f'{idx}']

        elif (idx > num_files[0] and idx < (sum(num_files[:2])+1)): ## 501 - 1500

            self.col_dir = self.path_names[f'{folder_names[1]}']['colour'][f'{idx-num_files[0]}']
            self.depth_dir = self.path_names[f'{folder_names[1]}']['depth'][f'{idx-num_files[0]}']

        elif (idx > sum(num_files[:2]) and idx < (sum(num_files[:3])+1) ): ## 1501 - 2600

            self.col_dir = self.path_names[f'{folder_names[2]}']['colour'][f'{idx-sum(num_files[:2])}'] # -1500
            self.depth_dir = self.path_names[f'{folder_names[2]}']['depth'][f'{idx-sum(num_files[:2])}']

        elif (idx > sum(num_files[:3]) and idx < (sum(num_files[:4])+1) ): ## 2601 - 5600

            self.col_dir = self.path_names[f'{folder_names[3]}']['colour'][f'{idx-sum(num_files[:3])}'] #-2600
            self.depth_dir = self.path_names[f'{folder_names[3]}']['depth'][f'{idx-sum(num_files[:3])}']
            
        elif (idx > sum(num_files[:4]) and idx < (sum(num_files[:5])+1) ): ## 5601 - 7857

            self.col_dir = self.path_names[f'{folder_names[4]}']['colour'][f'{idx-sum(num_files[:4])}'] # -5600
            self.depth_dir = self.path_names[f'{folder_names[4]}']['depth'][f'{idx-sum(num_files[:4])}']

        elif (idx > sum(num_files)):
            raise NameError('Index outside of range')

        col_img = import_raw_colour_image(self.col_dir)
        depth_img = import_raw_depth_image(self.depth_dir)
        if self.trans_on == True:
            col_img = torch.from_numpy(np.flip(col_img,axis=0).copy()) # apply any transforms
            depth_img = torch.from_numpy(np.flip(depth_img,axis=0).copy()) # apply any transforms
            col_img = col_img.transpose(0,2)
            col_img = col_img.transpose(1,2)
        if self.transform: # if any transforms were given to initialiser
            col_img = self.transform(col_img) # apply any transforms
        return col_img, depth_img
    
    def __len__(self):
        return sum(num_files)



NameError: name 'Dataset' is not defined

Creating an instance of the dataset in order to create training, validation and testing datasets.

In [11]:
total_Data = ModerateDataset(trans_on=True)  ## instancing the dataset

*************MAKE SURE THE PATH FILE IN THE FOR LOOP IS THE BASE IMAGE DIRECTORY ON YOUR COMPUTER**************


## Train, validation and test splitting

It is of vital importance to establish the separation of three datasets: training, validation and testing. Training data is used to train the neural network model and validation data is used to check that the model is not overfitting to the training data. Testing data is used to check the performance of the trained model on unseen data to evaluate performance with a set of predefined metrics (defined in the evaluation procedure section).

A train, validation, testing split of 80/10/10 has been used to create three datasets: train_dataset, val_dataset and test_dataset. This split is commonly used in machine learning research. These datasets all inherit from the ModerateDasaset class. For each of these datasets, a data loader was created to load in a batch of images at once instead of loading the entire dataset to memory. To train the model, the training and validation dataloaders are used. This ensures that no testing data is used in any step of training the model.


In [12]:
train_size = int(0.8 * len(total_Data))
val_size = int((len(total_Data) - train_size)/2)
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(total_Data, [train_size, val_size, val_size])

In [13]:
batch_sz = 16
tr_dl  = DataLoader(train_dataset,  batch_size=batch_sz, shuffle=True,  num_workers=0)
val_dl = DataLoader(val_dataset,  batch_size=batch_sz, shuffle=True,  num_workers=0)
test_dl = DataLoader(test_dataset,  batch_size=batch_sz, shuffle=True,  num_workers=0)

# Simple CNN model

One of the key deliverables in the project proposal is to create a simple neural network architecture that uses an RGB image as an input outputs a depth image. This is realised below by the use of a convolutional neural network. The network, referred to as the 'Simple CNN' model, is defined below. It consists of two convolutional and two deconvolutional layers. A 3 channel colour image is input into the first conv layer, this increases the number of images channels to 6. After this, a rectified linear unit activation function is applied to the convolved data. The image is then passed through another conv layer increasing the channels to 12. This then leads to two deconv layers, outputting a singular channeled depth image with the same resolution as the imput image, this is ensured by the use of the kernal size, stride, and padding variables.

In [12]:
net = nn.Sequential(
    nn.Conv2d(in_channels=3,  out_channels=6, kernel_size=3, stride=1, padding=1), 
    nn.ReLU(),
    nn.Conv2d(in_channels=6, out_channels=12, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.ConvTranspose2d(in_channels = 12, out_channels=6, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.ConvTranspose2d(in_channels = 6, out_channels=1, kernel_size=3, stride=1, padding=1),
    nn.ReLU()
).cuda()


Using a model summary to show the structure of the network and setting the image dimensions to (3,720,1280) and the batch size to 16.

In [13]:
summary(net, (3,720,1280), 16)

## Training loop

Here, a fitting function is defined in which a neural network, training dataloader and validation dataloader are passed. The user can modify the loss function, number of training epochs, learning rate and weight decay used to train the network.

The fitting/training function used to train the model is defined by the following for the SimpleCNN:
- Loss function: Mean square error loss
- Epochs: 2
- Learning rate: 1x10-3
- Weight decay: 1x10-3
- Optimiser: Adam
- Training batch size = 16
- Validation batch size = 16
- Shuffling: Training=True, Validation=False
- Metrics that are tracked: Training loss, validation loss, validation accuracy


In [11]:
def fit(net, tr_dl, val_dl, loss=nn.MSELoss(), epochs=3, lr=3e-3, wd=1e-3):   

    Ltr_hist, Lval_hist = [], []    
    opt = optim.Adam(net.parameters(), lr=lr, weight_decay=wd)
    for epoch in trange(epochs):
        
        L = []
        dl = (iter(tr_dl))
        count_train = 0
        for xb, yb in tqdm(dl, leave=False):
            xb, yb = xb.float(), yb.float()
            xb, yb = xb.cuda(), yb.cuda()
            y_ = net(xb)
            l = loss(y_, yb)
            opt.zero_grad()
            l.backward()
            opt.step()
            L.append(l.detach().cpu().numpy())
            print(f"Training on batch {count_train} of {int(train_size/batch_sz)}")
            count_train+= 1

        # disable gradient calculations for validation
        for p in net.parameters(): p.requires_grad = False

        Lval, Aval = [], []
        val_it = iter(val_dl)
        val_count = 0
        for xb, yb in tqdm(val_it, leave=False):
            xb, yb = xb.float(), yb.float()
            xb, yb = xb.cuda(), yb.cuda()
            y_ = net(xb)
            l = loss(y_, yb)
            Lval.append(l.detach().cpu().numpy())
            Aval.append((y_.max(dim=1)[1] == yb).float().mean().cpu().numpy())
            print(f"Validating on batch {val_count} of {int(val_size/batch_sz)}")
            val_count+= 1

        # enable gradient calculations for next epoch 
        for p in net.parameters(): p.requires_grad = True 
            
        Ltr_hist.append(np.mean(L))
        Lval_hist.append(np.mean(Lval))
        print(f'training loss: {np.mean(L):0.4f}\tvalidation loss: {np.mean(Lval):0.4f}\tvalidation accuracy: {np.mean(Aval):0.2f}')
    return Ltr_hist, Lval_hist

## Training the network

For the SimpleCNN it was not imperative to train the model for a large number of epochs since the goal was to produce data that could be used to test the evaluation procedure.
Due to the large amount of training data, training on a Nvidia 2070 Max Q 8GB and an i7-8750H with 16GB ram takes around 30 minutes per epoch.

In [14]:
#Ltr_hist, Lval_hist = fit(net.cuda(), tr_dl, val_dl, epochs=2)

## Loading and saving the trained model

To save time re-training a model from scratch, the model is saved and can be reloaded without needing retraining.

In [13]:
path = 'model_20042020'

In [14]:
#torch.save(net, path)

In [15]:
re_load_trained_model = torch.load(path)

In [16]:
re_load_trained_model.eval()

Sequential(
  (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Conv2d(6, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU()
  (4): ConvTranspose2d(12, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): ReLU()
  (6): ConvTranspose2d(6, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU()
)

# Evaluation


## Motivation for evaluation
After the models have been trained it is important to understand how well they perform at the task of predicting a depth map from an RGB image. To do this, a portion of the Moderate Collection data was set aside as a testing dataset. This test dataset contains 786 RGB/Depth image pairs that the models have not been exposed to. These image pairs, therefore, can be used to evaluate how well the trained models perform on unseen data from the same overall dataset.



Each RGB image will be passed through the models and a predicted depth map will be produced. This depth map will then be compared to the ground truth depth map associated with it. There are numerous ways in which to compare these depth maps. A simple method could be showing an observer both depth maps and the RGB image and asking them which they think is best and where in the image the predicted depth map lacks clarity. This process takes a large amount of time per depth map, on the order of a minute, and the results generated are qualitative and not quantitative. This can be done for a few images to get a small insight into the appearance of the predicted depth maps. To evaluate all 786 RGB/Depth maps, a quantitative approach is needed.



One way of comparing one depth map to another is to compare them pixel by pixel. The difference between predicted depth and ground truth can be found for each pixel and the total difference can be calculated by summing these individual differences. An average difference from one depth map to the other can then be calculated by dividing the total summed differences and dividing by the number of pixels in the depth map. This is known as the mean difference error. A pitfall of this error is that negative differences in depth can cancel out positive differences leading to an unreliable result. To negate this effect, we will take the absolute value of the difference for each comparison. This will ensure that the total difference calculated is positive and that the errors correctly compound. This is known as the MAE-Mean Absolute Error and is defined in the ‘Error metrics’ list under ‘Evaluation procedure’ below.


In [18]:
simple_predictions, simple_gts = predict_and_gt(val_dl, val_size, batch_sz, re_load_trained_model)

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))



## Initialise a dictionary for holding the calculated errors

In [19]:
error_dictionary = {}
chunk_size = val_size//batch_sz
for i in range(chunk_size+1):
        error_dictionary[f"{i}"] = {}
for i in range(chunk_size+1):
    for j in range(batch_sz):
        error_dictionary[f"{i}"][f"{j}"] = {}
# error_dictionary

In [42]:
%%time
for i in range(chunk_size-1):
    for j in range(batch_sz-1): # batch_sz):
        error_dictionary[f"{i}"][f"{j}"] = calc_errors(simple_predictions[i][j], simple_gts[i][j])
    print(f"Calculating errors for batch {i} of {int(chunk_size)}")

Calculating errors for batch 0 of 49
Calculating errors for batch 1 of 49
Calculating errors for batch 2 of 49
Calculating errors for batch 3 of 49
Calculating errors for batch 4 of 49


KeyboardInterrupt: 

## Calculate average errors over test dataset

In [18]:
(numpy_depth_prediction[49][0])

NameError: name 'numpy_depth_prediction' is not defined

In [124]:
%%time
# initialisation of average errors
difference_err_avg = 0
sqr_diff_err_avg = 0
inv_err_avg = 0
inv_sqr_err_avg = 0
log_err_avg = 0
log_sqr_err_avg = 0
log_non_abs_err_avg = 0
abs_rel_err_avg = 0
sqr_rel_err_avg = 0

Wall time: 0 ns


In [125]:
%%time
for i in range(0, int(val_size/batch_sz)):
    for j in range(0, batch_sz-1):
        difference_err_avg += error_dictionary[f"{i}"][f"{j}"][0]
        sqr_diff_err_avg += error_dictionary[f"{i}"][f"{j}"][1]
        inv_err_avg += error_dictionary[f"{i}"][f"{j}"][2]
        inv_sqr_err_avg += error_dictionary[f"{i}"][f"{j}"][3]
        log_err_avg += error_dictionary[f"{i}"][f"{j}"][4]
        log_sqr_err_avg += error_dictionary[f"{i}"][f"{j}"][5]
        log_non_abs_err_avg += error_dictionary[f"{i}"][f"{j}"][6]
        abs_rel_err_avg += error_dictionary[f"{i}"][f"{j}"][7]
        sqr_rel_err_avg += error_dictionary[f"{i}"][f"{j}"][8]

Wall time: 2.44 ms


In [126]:
## divide by number of images to get average error
difference_err_avg /= (val_size)
sqr_diff_err_avg /= (val_size)
inv_err_avg /= (val_size)
inv_sqr_err_avg /= (val_size)
log_err_avg /= (val_size)
log_sqr_err_avg /= (val_size)
log_non_abs_err_avg /= (val_size)
abs_rel_err_avg /= (val_size)
sqr_rel_err_avg /= (val_size)
print(difference_err_avg, sqr_diff_err_avg, inv_err_avg, inv_sqr_err_avg, log_err_avg, log_sqr_err_avg, log_non_abs_err_avg, abs_rel_err_avg, sqr_rel_err_avg)

0.15349015485873604 0.20413706169302126 216.92079890689172 449.87569844827283 1.4395006203028675 1.6912679620310802 1.0779036028588291 17.408615003311972 57917.195704816244


## Calculate standard deviation in average errors

In [141]:
# initialise difference counters
difference_err_count = 0
sqr_diff_err_count = 0
inv_err_count = 0
inv_sqr_err_count = 0
log_err_count = 0
log_sqr_err_count = 0
log_non_abs_err_count = 0
abs_rel_err_count = 0
sqr_rel_err_count = 0

In [142]:
%%time
# sum squared differences
for i in range(0, int(val_size/batch_sz)):
    for j in range(0, batch_sz-1):
            difference_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - difference_err_avg)**2
            sqr_diff_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - sqr_diff_err_avg)**2
            inv_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - inv_err_avg)**2
            inv_sqr_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - inv_sqr_err_avg)**2
            log_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - log_err_avg)**2
            log_sqr_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - log_sqr_err_avg)**2
            log_non_abs_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - log_non_abs_err_avg)**2
            abs_rel_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - abs_rel_err_avg)**2
            sqr_rel_err_count += (error_dictionary[f"{i}"][f"{j}"][0] - sqr_rel_err_avg)**2

Wall time: 6.97 ms


In [144]:
# divide by number of test images
difference_err_count /= val_size
sqr_diff_err_count /= val_size
inv_err_count /= val_size
inv_sqr_err_count /= val_size
log_err_count /= val_size
log_sqr_err_count /= val_size
log_non_abs_err_count /= val_size
abs_rel_err_count /= val_size
sqr_rel_err_count /= val_size

In [146]:
# square root
difference_err_sigma = math.sqrt(difference_err_count)
sqr_diff_err_sigma = math.sqrt(sqr_diff_err_count)
inv_err_sigma = math.sqrt(inv_err_count)
inv_sqr_err_sigma = math.sqrt(inv_sqr_err_count)
log_err_sigma = math.sqrt(log_err_count)
log_sqr_err_sigma = math.sqrt(log_sqr_err_count)
log_non_abs_err_sigma = math.sqrt(log_non_abs_err_count)
abs_rel_err_sigma = math.sqrt(abs_rel_err_count)
sqr_rel_err_sigma = math.sqrt(sqr_rel_err_count)
print(difference_err_sigma, sqr_diff_err_sigma, inv_err_sigma, inv_sqr_err_sigma, log_err_sigma, log_sqr_err_sigma, log_non_abs_err_sigma, abs_rel_err_sigma, sqr_rel_err_sigma)

0.0031852591413042137 0.003451686762223687 7.476418605301046 15.51154946000541 0.04410363930439457 0.05276895537876926 0.031676132664665964 0.594808591100741 1997.6868841917874


In [148]:
mean_errors = [difference_err_avg, sqr_diff_err_avg, inv_err_avg, inv_sqr_err_avg, log_err_avg, log_sqr_err_avg, log_non_abs_err_avg, abs_rel_err_avg, sqr_rel_err_avg]
std_devs = [difference_err_sigma, sqr_diff_err_sigma, inv_err_sigma, inv_sqr_err_sigma, log_err_sigma, log_sqr_err_sigma, log_non_abs_err_sigma, abs_rel_err_sigma, sqr_rel_err_sigma]

In [160]:
for mean, std in zip(mean_errors, std_devs):
    print("Err = ""{:.3f}".format(mean), "\t +-\t""{:.3f}".format(std))

Err = 0.153 	 +-	0.003
Err = 0.204 	 +-	0.003
Err = 216.921 	 +-	7.476
Err = 449.876 	 +-	15.512
Err = 1.440 	 +-	0.044
Err = 1.691 	 +-	0.053
Err = 1.078 	 +-	0.032
Err = 17.409 	 +-	0.595
Err = 57917.196 	 +-	1997.687


In [21]:
val_preds, val_gts = predict_and_gt(val_dl, val_size, batch_sz, re_load_trained_model)

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))



In [22]:
test_preds, test_gts = predict_and_gt(test_dl, val_size, batch_sz, re_load_trained_model)

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))



In [15]:
# for re-loading saved variables
# f = open('test_gts.pckl', 'wb')
# pickle.dump(test_gts, f)
# f.close()
path_saved = 'C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Datasets/Saved_preds/test_gts.pckl'
f = open(path_saved, 'rb')
test_gts = pickle.load(f)
f.close()

In [16]:
path_saved = 'C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Datasets/Saved_preds/test_preds.pckl'
f = open(path_saved, 'rb')
test_preds = pickle.load(f)
f.close()

In [17]:
test_means, test_stds = mean_and_std_errors(test_preds, test_gts, val_size, batch_sz)

In [18]:
test_means

[0.15292886050578838,
 0.20256076692918262,
 169.7575138927575,
 337.4968641438327,
 1.3821823514353018,
 1.6306671019519026,
 1.0416930808944187,
 14.377977729056317,
 35337.66780938274]

In [19]:
test_stds

[0.08957746447381942,
 0.09489953776467451,
 162.3142804656832,
 322.856735180481,
 1.1664623689199738,
 1.4037154706957042,
 0.8418862470761609,
 13.60159222443305,
 33821.33976314626]

In [20]:
path_saved = 'C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Datasets/Saved_preds/val_preds.pckl'
f = open(path_saved, 'rb')
val_preds = pickle.load(f)
f.close()

In [21]:
path_saved = 'C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Datasets/Saved_preds/val_gts.pckl'
f = open(path_saved, 'rb')
val_gts = pickle.load(f)
f.close()

In [22]:
val_means, val_stds = mean_and_std_errors(val_preds, val_gts, val_size, batch_sz)

In [23]:
val_means

[0.1523476418379662,
 0.20036514733386704,
 206.6017020603569,
 403.54414631860675,
 1.4401937428260463,
 1.6913869644066872,
 1.0672831635531381,
 15.333102927912666,
 12584.9235118754]

In [24]:
val_stds

[0.09106520633516951,
 0.09579359667519721,
 197.57826652032824,
 386.07085383566994,
 1.222547923855846,
 1.4624174598570419,
 0.867006191162321,
 14.516336031463855,
 12044.806188680475]

In [28]:
path_saved = 'C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Datasets/Saved_preds/val_stds.pckl'
f = open(path_saved, 'wb')
pickle.dump(val_stds, f)
f.close()

# Standardisation

train dataloader is called tr_dl

In [21]:
mean = 0.0
count_mn = 0
for images, _ in tr_dl:
    images = images.float()
    batch_samples = images.size(0) 
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
    print(f"Whirrrrrr calculating..... {count_mn} of {train_size/batch_sz}")
    count_mn+= 1
mean = mean / len(tr_dl.dataset)


Whirrrrrr calculating..... 0 of 392.8125
Whirrrrrr calculating..... 1 of 392.8125
Whirrrrrr calculating..... 2 of 392.8125
Whirrrrrr calculating..... 3 of 392.8125
Whirrrrrr calculating..... 4 of 392.8125
Whirrrrrr calculating..... 5 of 392.8125
Whirrrrrr calculating..... 6 of 392.8125
Whirrrrrr calculating..... 7 of 392.8125
Whirrrrrr calculating..... 8 of 392.8125
Whirrrrrr calculating..... 9 of 392.8125
Whirrrrrr calculating..... 10 of 392.8125
Whirrrrrr calculating..... 11 of 392.8125
Whirrrrrr calculating..... 12 of 392.8125
Whirrrrrr calculating..... 13 of 392.8125
Whirrrrrr calculating..... 14 of 392.8125
Whirrrrrr calculating..... 15 of 392.8125
Whirrrrrr calculating..... 16 of 392.8125
Whirrrrrr calculating..... 17 of 392.8125
Whirrrrrr calculating..... 18 of 392.8125
Whirrrrrr calculating..... 19 of 392.8125
Whirrrrrr calculating..... 20 of 392.8125
Whirrrrrr calculating..... 21 of 392.8125
Whirrrrrr calculating..... 22 of 392.8125
Whirrrrrr calculating..... 23 of 392.8125
Wh

NameError: name 'count_var' is not defined

In [22]:
mean

tensor([ 98.2207, 101.6702, 102.9898])

In [24]:

var = 0.0
count_var =0
for images, _ in tr_dl:
    batch_samples = images.size(0)
    images = images.view(batch_samples, images.size(1), -1)
    var += ((images - mean.unsqueeze(1))**2).sum([0,2])
    print(f"Whirrrrrr calculating..... {count_var} of {train_size/batch_sz}")
    count_var+= 1
std = torch.sqrt(var / (len(tr_dl.dataset)*720*1280))

Whirrrrrr calculating..... 0 of 392.8125
Whirrrrrr calculating..... 1 of 392.8125
Whirrrrrr calculating..... 2 of 392.8125
Whirrrrrr calculating..... 3 of 392.8125
Whirrrrrr calculating..... 4 of 392.8125
Whirrrrrr calculating..... 5 of 392.8125
Whirrrrrr calculating..... 6 of 392.8125
Whirrrrrr calculating..... 7 of 392.8125
Whirrrrrr calculating..... 8 of 392.8125
Whirrrrrr calculating..... 9 of 392.8125
Whirrrrrr calculating..... 10 of 392.8125
Whirrrrrr calculating..... 11 of 392.8125
Whirrrrrr calculating..... 12 of 392.8125
Whirrrrrr calculating..... 13 of 392.8125
Whirrrrrr calculating..... 14 of 392.8125
Whirrrrrr calculating..... 15 of 392.8125
Whirrrrrr calculating..... 16 of 392.8125
Whirrrrrr calculating..... 17 of 392.8125
Whirrrrrr calculating..... 18 of 392.8125
Whirrrrrr calculating..... 19 of 392.8125
Whirrrrrr calculating..... 20 of 392.8125
Whirrrrrr calculating..... 21 of 392.8125
Whirrrrrr calculating..... 22 of 392.8125
Whirrrrrr calculating..... 23 of 392.8125
Wh

In [25]:
std

tensor([63.4003, 64.1523, 64.4491])

In [29]:
Moderate_tr_stats = mean, std

mean is tensor([ 98.2207, 101.6702, 102.9898])
std is tensor([63.4003, 64.1523, 64.4491])

In [30]:
transforms.Normalize(*Moderate_tr_stats)

Normalize(mean=tensor([ 98.2207, 101.6702, 102.9898]), std=tensor([63.4003, 64.1523, 64.4491]))

In [45]:
normed_dataset = ModerateDataset(trans_on=True, transform=transforms.Normalize(*Moderate_tr_stats))

*************MAKE SURE THE PATH FILE IN THE FOR LOOP IS THE BASE IMAGE DIRECTORY ON YOUR COMPUTER**************


In [46]:
normed_dl = DataLoader(normed_dataset,  batch_size=batch_sz, shuffle=True,  num_workers=0)

In [48]:
next(iter(normed_dl))

[tensor([[[[2, 2, 2,  ..., 3, 3, 3],
           [2, 2, 2,  ..., 3, 3, 3],
           [2, 2, 2,  ..., 3, 3, 3],
           ...,
           [3, 3, 3,  ..., 3, 3, 3],
           [3, 3, 3,  ..., 3, 3, 3],
           [3, 3, 3,  ..., 3, 3, 3]],
 
          [[2, 2, 2,  ..., 2, 2, 2],
           [2, 2, 2,  ..., 2, 2, 2],
           [2, 2, 2,  ..., 2, 2, 2],
           ...,
           [2, 2, 2,  ..., 3, 3, 3],
           [2, 2, 2,  ..., 3, 3, 3],
           [2, 2, 2,  ..., 3, 3, 3]],
 
          [[2, 2, 2,  ..., 2, 2, 2],
           [2, 2, 2,  ..., 2, 2, 2],
           [2, 2, 2,  ..., 2, 2, 2],
           ...,
           [2, 2, 2,  ..., 3, 3, 3],
           [2, 2, 2,  ..., 3, 3, 3],
           [2, 2, 2,  ..., 3, 3, 3]]],
 
 
         [[[3, 3, 3,  ..., 0, 0, 0],
           [3, 3, 3,  ..., 0, 0, 0],
           [3, 3, 3,  ..., 0, 0, 0],
           ...,
           [2, 2, 2,  ..., 1, 1, 1],
           [2, 2, 2,  ..., 1, 1, 1],
           [2, 2, 2,  ..., 1, 1, 1]],
 
          [[3, 3, 3,  ..., 0, 0, 

# Kitti dataloader

In [103]:
root_dir = 'C:/Users/Ben/OneDrive - Bournemouth University/Computer Vision/Datasets/Kitti/depth_selection/val_selection_cropped/'

In [117]:
class Kitti(Dataset):
  def __init__(self, data_root, transform=torchvision.transforms.ToTensor(),):
    self.samples = {}
    self.transform = transform
    for file in os.listdir(data_root):
      subfolder_list = [ os.path.join( data_root, file , subfolder ) for subfolder in os.listdir(os.path.join(data_root, file))]
      self.samples[file] = subfolder_list

    keys = [ key for key in self.samples.keys()]
    self.RGB_DIRS   = self.samples[keys[0]]
    self.DEPTH_DIRS = self.samples[keys[1]]

    self.length = int(sum([len(self.samples[key]) for key in self.samples.keys()])*0.5)

  def __len__(self):
    return self.length

  def __getitem__(self, index):
    RGB_IMAGES   = Image.open(self.RGB_DIRS[index])
    DEPTH_IMAGES = Image.open(self.DEPTH_DIRS[index])

    if self.transform:
      RGB_IMAGES   = self.transform(RGB_IMAGES)
      DEPTH_IMAGES = self.transform(DEPTH_IMAGES)    
      
    return RGB_IMAGES, DEPTH_IMAGES

In [118]:
ki = Kitti(root_dir)

In [119]:
ki.__getitem__(1)

(tensor([[[0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0],
          ...,
          [0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0]]], dtype=torch.int32),
 tensor([[[1.0000, 1.0000, 1.0000,  ..., 0.0314, 0.0314, 0.0353],
          [1.0000, 1.0000, 1.0000,  ..., 0.0353, 0.0353, 0.0353],
          [1.0000, 1.0000, 1.0000,  ..., 0.0392, 0.0392, 0.0353],
          ...,
          [0.2627, 0.2627, 0.2588,  ..., 0.1686, 0.2196, 0.2863],
          [0.2549, 0.2627, 0.2706,  ..., 0.1294, 0.1451, 0.1725],
          [0.2588, 0.2627, 0.2667,  ..., 0.2314, 0.2235, 0.2275]],
 
         [[1.0000, 1.0000, 1.0000,  ..., 0.0431, 0.0471, 0.0471],
          [1.0000, 1.0000, 1.0000,  ..., 0.0392, 0.0353, 0.0392],
          [1.0000, 1.0000, 1.0000,  ..., 0.0431, 0.0392, 0.0353],
          ...,
          [0.2706, 0.2745, 0.2706,  ..., 0.2902, 0.3098, 0.2980],
          [0.2745, 0.2745, 0.2706,  ..., 0.2588, 0.270

In [120]:
total_Data = Kitti(root_dir)
rgb_image, depth_image = total_Data[100]
len(total_Data)

1000

In [121]:
fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax1.imshow(depth_image)
ax2 = fig.add_subplot(1,2,2)
ax2.imshow(rgb_image)

TypeError: Invalid shape (3, 352, 1216) for image data

In [122]:
kitti_dataloader = DataLoader(total_Data,  batch_size=16, shuffle=True,  num_workers=0)

In [129]:
next(iter(kitti_dataloader))[1][0].shape

torch.Size([3, 352, 1216])