# Create Basic Setup
### 1. Installation
* connect to own drive
* create paths used throughout the notebook
* get and install own repo
* get and install waymo repo

### 2. Data
* transfer waymo dataset from gcs to gdrive
* unpack 
* convert 

### 3. Training
* copy list of files into data dir if necessary
* training loop
* visual assessment

Remarks:
* __linux_version paths 
    * should NOT be concatenated using e.g. os.path.join
    * for usage with magic command: {path}
* waymo: tf is version 1.x
* data transfer: 
    * runtime cpu
    * google file stream only allows a limited number of operations per time interval  
* training: runtime gpu
* tensorboard: enable 3rd party cookies in your browser

# 1. Installation

In [0]:
%tensorflow_version 1.x 

'''
MOUNT
'''

from google.colab import drive
drive.mount('/content/drive')

In [None]:
'''
Create PATHS
'''

ROOT_DIR__linux_version = '/content/drive/My\ Drive/Colab\ Notebooks/DeepCV_Packages/'
DATA_DIR__linux_version = ROOT_DIR__linux_version + 'data/'
REPO_DIR__linux_version = ROOT_DIR__linux_version + 'DMMFODS/'
DMMFODS_DIR__linux_version = REPO_DIR__linux_version + 'dmmfods/'

ARCHIVE_DEST_DIR__linux_version = '/content/drive/My\ Drive/Colab\ Notebooks/'                  # this should be a repo containing very few files

newly_cloned = False

In [None]:
'''
GET OWN REPO
'''

%cd {ROOT_DIR__linux_version}
!rm -rf {REPO_DIR__linux_version}
!git clone https://github.com/pmcgrath249/DMMFODS.git

newly_cloned = True

In [0]:
'''
INSTALL EVERYTHING
'''

# permanently change dir 
%cd {DMMFODS_DIR__linux_version}

# install waymo dataset utils in utils; https://github.com/waymo-research/waymo-open-dataset/blob/master/tutorial/tutorial.ipynb
!cd utils && rm -rf waymo-od > /dev/null
!cd utils && git clone https://github.com/waymo-research/waymo-open-dataset.git waymo-od
!cd utils/waymo-od && git branch -a
!cd utils/waymo-od && git checkout remotes/origin/r1.0
!pip3 install --upgrade pip
!pip3 install waymo-open-dataset

# install requirements
!cd {REPO_DIR__linux_version} && pip3 install -r requirements.txt

# install own package
!cd {REPO_DIR__linux_version} && python3 -m pip install .

# 2. DATA

Data source: https://console.cloud.google.com/storage/browser/waymo_open_dataset_v_1_0_0

### Transfer

Help 1: https://medium.com/@philipplies/transferring-data-from-google-drive-to-google-cloud-storage-using-google-colab-96e088a8c041

I had to change Help 1 because I was not able to find  the project_id nessecary for this approach

Help 2: https://cloud.google.com/storage/docs/access-public-data?hl=de 

REMARK: No costs arise as the bucket is managed by waymo

### Note

* Use a CPU: runtime for this section. It gives access to more disk storage w.r.t. the compute instance. Unpacking clutters the disk.
* No os.join for __linux_version paths: cannot be concatenated using os.join because of spaces and escaping characters within the paths
* Small number of files/dir: Due to COLABxDRIVE issues, it is important to copy datasets to directories with little content. That's why data is divided into unnecessary subdirectories
Otherwise it is not possible to extract files from the archives reliably.
Moreover, I have had issues with moving archives -> iterative procedure  
https://research.google.com/colaboratory/faq.html#drive-timeout   
* Serialize in batches: Loading data from Drive is the bottleneck at train time. I found that disks cannot keep up with a high number of operations, i.e. loading batched data can speed up the loading procedure by 3-10x
* only a limited number of operations is allowed regarding google file stream (gdrive & colab) -> if too many operations are requested in a certain amount of time, you will not be able to run any notebooks for 24hrs 
    * colab disk is going to fill
    * your notebook fails to save
    * although operations are being processed nothing is written to gdrive
* Empty the trash once in a while to not run in trouble with unpacking archives
  
 

In [None]:
'''
AUTHENTICATE GCS
'''

from google.colab import auth
auth.authenticate_user()

In [None]:
'''
LIST DIR TO BE COPIED
'''

bucket_name = 'waymo_open_dataset_v_1_0_0'
!gsutil ls -r gs://{bucket_name}/

In [None]:
'''
COPY
UNPACK
REDISTRIBUTE
CONVERT
BATCHING
REDISTRIBUTE
'''

import os
from dmmfods.utils import Dense_U_Net_lidar_helper as utils
from pathlib import Path

config = utils.get_config()

# naming
bucket_name = 'waymo_open_dataset_v_1_0_0'
training_bucket = os.path.join(bucket_name, 'training')

for i in range(4):                                                                                 # from ls above
    dataset_name = 'training_000{}.tar'.format(i) if i < 10 else 'training_00{}.tar'.format(i)      # right amount of leading zeros
    data_bucket = os.path.join(training_bucket, dataset_name)                 
    current_training_dir = os.path.join(config.dir.data.root, dataset_name[:-4])

    # copy
    print('start copying: ' + dataset_name)
    Bucket_Dest__linux_version = DATA_DIR__linux_version + dataset_name[:-4] + '/'
    !mkdir -p {Bucket_Dest__linux_version}
    !gsutil -m cp -r gs://{data_bucket}/ {Bucket_Dest__linux_version}                               # copy multi-threaded and recursively

    # unpack
    print('start unpacking: ' + dataset_name)
    archive_full_path = Bucket_Dest__linux_version + dataset_name
    !tar -xvf {archive_full_path} -C {Bucket_Dest__linux_version}

    print('deleting archive: ' + dataset_name)
    !rm {archive_full_path}

    filenames = os.listdir(current_training_dir)

    # redistribute
    for j, filename in enumerate(filenames):
        if not filename.endswith('tfrecord'):
            continue
        tf_data_dir = 'tf_' + str(j)
        Path(os.path.join(current_training_dir, tf_data_dir)).mkdir()
        oldpath = os.path.join(current_training_dir, filename)
        newpath = os.path.join(current_training_dir, tf_data_dir, filename)
        os.rename(oldpath, newpath)

    # convert
    print('converting data of dir: ' + str(i))
    utils.waymo_to_pytorch_offline(data_root=current_training_dir, idx_dataset_batch=i)

    # batching and redistributing into train, val, test
    print('batching data of dir: ' + str(i) + ' and putting it into ' + mode)
    config = utils.get_config()
    mode = 'train' 
    utils.save_data_in_batch(config, dataset_name, mode)

# 3. Training

In [None]:
'''
TRAINING WITH TENSORBOARD VISUALIZATION
'''

# import
%cd {REPO_DIR__linux_version}
from dmmfods.utils.Dense_U_Net_lidar_helper import get_config
from dmmfods.agents.Dense_U_Net_lidar_Agent import Dense_U_Net_lidar_Agent as Dense_U_Agent
import os
from pathlib import Path

config = get_config()
config.agent.max_epoch = 20

# agent takes care of everything incl. tensorboard dirs
agent = Dense_U_Agent(config=config, torchvision_init=True)

# use tensorboard to visualize
%load_ext tensorboard
%tensorboard --logdir {os.path.join(*config.dir.current_run.summary.split('/')[-2:])}

# if newly installed
if newly_cloned:
    Path(config.dir.data.file_lists).mkdir(exist_ok=True)
    !cp {ROOT_DIR__linux_version + config.dataset.file_list_name} {DMMFODS_DIR__linux_version + 'data/' + config.dataset.file_list_name} 

# start training
agent.run()
agent.finalize()

In [None]:
'''
VISUALLY ASSESS DATA AFTER FORWARD PASS
'''

import numpy as np
import matplotlib.pyplot as plt
from dmmfods.utils.Dense_U_Net_lidar_helper import get_config
from dmmfods.agents.Dense_U_Net_lidar_Agent import Dense_U_Net_lidar_Agent as Dense_U_Agent

def visual_assessment(img, lidar, pred, gt):

    num_plots = gt.shape[0]
    fig=plt.figure(figsize=(4*7,num_plots*7))
    for i in range(num_plots):
        
        # rgb image
        im = img[i].permute(1, 2, 0).detach().numpy().astype(np.uint8)
        fig.add_subplot(num_plots, 4, i*4+1)   
        plt.imshow(im)

        # lidar image
        l = lidar[i].permute(1, 2, 0)[:,:,0].detach().numpy().astype(np.uint8)
        fig.add_subplot(num_plots, 4, i*4+2)   
        plt.imshow(l, cmap=plt.cm.gray)

        # network output
        p = pred[i].permute(1, 2, 0)[:,:,0].detach().numpy().astype(np.uint8)
        fig.add_subplot(num_plots, 4, i*4+3)   
        plt.imshow(p, cmap=plt.cm.gray)

        # ground truth
        g = gt[i].permute(1, 2, 0)[:,:,0].detach().numpy()
        fig.add_subplot(num_plots, 4, i*4+4)   
        plt.imshow(g, cmap=plt.cm.gray)

    plt.show()

config = get_config()
config.optimizer.mode = 'train'
agent = Dense_U_Agent(torchvision_init=True)

# visualize one batch
for image, lidar, ht_map in agent.data_loader.train_loader:
    if agent.cuda:
        image = image.cuda()
        lidar = lidar.cuda()
    prediction = agent.model(image, lidar)
    visual_assessment(image.cpu(), lidar.cpu(), prediction.cpu(), ht_map)
    break