# MODEL TRAINING

The purpose of this notebook is to train different model learning configurations. Outputs (model checkpoints, configuration dictionary) are saved in the specified directory below. Data path should lead to output folder created during preprocessing.


In [1]:
import torch
import torchvision
from torch.backends import cudnn 
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
cudnn.benchmark = True # might speed up runtime

import os
import shutil
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

import models
import losses
import datasets
from helpers import io, trainer, run_manager

In [2]:
EXP_ID = "trees_points_full_final"

DATA_PATH = "/home/jovyan/work/processed/256x256"
SAVE_PATH = "/home/jovyan/work/runs"

EM = run_manager.Manager(EXP_ID, SAVE_PATH)
TB = SummaryWriter(EM.save_path)


## Data

Dataset class takes list of image names as _images_ parameter. These lists for training and validation sets are created in the next cell from .txt files created during preprocessing in the _image_sets_ folder.

Preprocessing writes image names to files based on what label-information is available for each image. The denmark_points dataset only returns points and the denmark_shapes dataset only returns shapes - first one should be used for training, second for testing and training of the U-Net. All can return both, with labels which information is available. Currently used for validation to calculate mIoU and loss - be careful when using since output is NOT predictable and data is maybe not passed correctly if the passed labels are left unchecked.

Rules for image lists, regarding the specific models:

Training:
- LCFCN: use denmark_points, list must contain only tiles with points
- COB-LCFCN: use denmark_cob, list must contain only tiles with points
- UNet: use denmark_shapes, list must contain only tiles with shapes
- Mixed: use denmark_all, list must only contain tiles with shapes or points
- Stacked: use denmark_stacked, no rules for list

Validation:
- LCFCN, COB-LCFCN, Mixed: use denmark_all, list must only contain tiles with shapes or points
- UNet, Stacked: can stay on denmark_all, should be no problem - change to denmark_shapes for debugging purposes

In [3]:
# basic settings for dataset
EM.object_type = "trees" # trees, buildings
EM.dataset_type = "denmark_points" # denmark_points, denmark_points_cob, denmark_shapes, denmark_all, denmark_stacked
EM.n_classes = 2 # 0: background, 1: object = 2
EM.batch_size_train = 1
EM.batch_size_val = 1

# load image-lists from files
images_path_points = os.path.join(DATA_PATH, 'image_sets_'+EM.object_type, 'points.txt')
images_list_points = [name.replace("\n","") for name in io.readText(images_path_points)]
images_path_shapes = os.path.join(DATA_PATH, 'image_sets_'+EM.object_type, 'shapes.txt')
images_list_shapes = [name.replace("\n","") for name in io.readText(images_path_shapes)]
images_list_points_filtered = list(set(images_list_points) - set(images_list_shapes))

train_size = 1750 #round(len(images_list_points) * 0.8) #18985 
train_images = images_list_points_filtered  #images_list_shapes[:train_size]  
val_size =  round(len(images_list_shapes) * 0.5) #round(len(images_list_points) * 0.1)  #1108
val_images = images_list_shapes[:val_size]   #images_list_shapes[train_size:(train_size + val_size)]  

# create transformation object
transform_mean = [0.492, 0.475, 0.430] # from preprocessing
transform_std = [0.176, 0.173, 0.176]

EM.transform = transforms.Compose([transforms.ToTensor(),
                                   transforms.Normalize(mean = transform_mean, 
                                                        std = transform_std)])

print(f"Dataset sizes: \n - train: {len(train_images)} \n - val: {len(val_images)}")

Dataset sizes: 
 - train: 18985 
 - val: 1108


In [4]:
train_set = datasets.getDataset(name = EM.dataset_type,
                                path = DATA_PATH,
                                images = train_images,
                                object_type = EM.object_type,
                                n_classes = EM.n_classes,
                                transform = EM.transform)

train_sampler = torch.utils.data.RandomSampler(train_set)

train_loader = DataLoader(train_set, sampler = train_sampler,
                          batch_size = EM.batch_size_train, 
                          drop_last = True, num_workers = 2, pin_memory = True)

val_set = datasets.getDataset(name = "denmark_all",
                              path = DATA_PATH,
                              images = val_images,
                              object_type = EM.object_type,
                              n_classes = EM.n_classes,
                              transform = EM.transform)

val_sampler = torch.utils.data.SequentialSampler(val_set)

val_loader = DataLoader(val_set, sampler = val_sampler,
                        batch_size = EM.batch_size_val,
                        num_workers = 2, pin_memory = True)

print("Dataloaders ready...")

Dataloaders ready...


## Model

Model can be selected from _vgg16_, _lcfcn_ and _unet_. Available loss functions for point-supervision are _point_ and and _point\_cob_. 

Rules for model configuration:
- LCFCN: use point type, lcfcn, point loss
- COB-LCFCN: use point_cob type, lcfcn, point_cob loss
- UNet: use supervised type, unet, dice loss
- Mixed: use mixed type, lcfcn, point_cob loss
- Stacked: use stacked type, unet, dice loss

In [5]:
# basic settings for model
EM.type = 'point' # point, point_cob, supervised, mixed, stacked
EM.net_name = 'lcfcn' # lcfcn for point & point_cob, unet for supervised (resnet available)
EM.loss_name = 'point' #  point, point_cob, dice (custom) and BCELoss, CrossEntropy (stock)
EM.opt_name = 'adam'

# optimizer-specific settings
EM.adam_learning_rate = 1e-5
EM.adam_betas = (0.99, 0.999)
EM.adam_decay = 0.0005 

In [6]:
model = models.getNet(EM.net_name, EM.n_classes).cuda()

criterion = losses.getLoss(EM.loss_name)

optimizer = torch.optim.Adam(model.parameters(), lr = EM.adam_learning_rate, betas = EM.adam_betas, weight_decay = EM.adam_decay)

print("Model ready...")

Model ready...


## Run Management

Check if a previous run with the same ID exists and either load the last state dicts or move the run folder into the backup folder.

In [7]:
EM.begin()

if os.path.exists(os.path.join(EM.save_path, 'checkpoint_last.pth')):
    confirm = input("Saved run with same ID found - load (l), rename (r) or cancel (c)?: ")
    
    if confirm == 'load' or confirm == 'l':
        # take epoch settings from manager
        EM = io.loadPKL(os.path.join(EM.save_path, 'manager.pkl'))
        # load state dicts
        checkpoint = torch.load(os.path.join(EM.save_path, 'checkpoint_last.pth'))
        model.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        print(f"Loaded previous run - continuing from epoch {EM.current_epoch}...")
   
    elif confirm == 'rename' or confirm == 'r':
        # rename existing experiment
        TB.close()
        os.rename(EM.save_path, os.path.join(SAVE_PATH, EM.id+"_"+str(np.random.randint(100, 999))))
        TB = SummaryWriter(EM.save_path)
        print(f"Starting new run from epoch 0...")
    
    else:
        print("No action taken...")
else:
    print(f"Starting new run from epoch 0...")

Starting new run from epoch 0...


## Main Epoch Loop

Each epoch conists of training, validation, updating the statstics and saving the best as well as the most recent model and validation statistics

In [None]:
EM.epochs = 50
start_epoch = EM.current_epoch+1

for epoch in tqdm(range(start_epoch, EM.epochs)):
    
    # Training Phase
    train_loss = trainer.trainModel(model, optimizer, train_loader, criterion, EM.type)
    TB.add_scalar('training loss', train_loss, epoch)
    print(f"Training done with loss: {train_loss}")
    
    # Validation Phase
    val_loss_dict = trainer.valModel(model, val_loader, criterion, EM.type)
    val_loss = val_loss_dict["loss"]
    val_mIoU = val_loss_dict["mIoU"]
    TB.add_scalar('validation loss', val_loss, epoch)
    TB.add_scalar('validation mIoU', val_mIoU, epoch)
    print(f"Validation done with loss: {val_loss} and mIoU: {val_mIoU}")
    
    # update experiment manager with losses
    loss_dict = {'epoch': epoch+1, 'train': train_loss, 'val_loss': val_loss, 'val_mIoU': val_mIoU}
    EM.loss_list += [loss_dict]
    EM.current_epoch = epoch
    print("\n", pd.DataFrame(EM.loss_list).tail(), "\n")
    
    # save model optimizer and manager as checkpoint
    checkpoint = {'epoch': epoch+1, 'model': model.state_dict(), 'optimizer': optimizer.state_dict()}
    torch.save(checkpoint, os.path.join(EM.save_path, 'checkpoint_last.pth'))
    io.savePKL(os.path.join(EM.save_path, 'manager.pkl'), EM)
    
    # check if new best model
    if epoch == 0 or val_mIoU > EM.best_loss:
        torch.save(checkpoint, os.path.join(EM.save_path, 'checkpoint_best.pth'))
        EM.best_loss = val_mIoU
        print("New best...")
    print("Checkpoint saved... ")

print(f"Run completed!")
TB.close()

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.4821531830987316


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 4.582385028870257 and mIoU: 0.11801935294753986

     epoch     train  val_loss  val_mIoU
21     22  2.589658  3.617469  0.097451
22     23  2.559167  6.294258  0.116078
23     24  2.536203  3.726577  0.109498
24     25  2.502527  5.075934  0.130713
25     26  2.482153  4.582385  0.118019 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.460009833988841


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 6.396224819321969 and mIoU: 0.10864152822269639

     epoch     train  val_loss  val_mIoU
22     23  2.559167  6.294258  0.116078
23     24  2.536203  3.726577  0.109498
24     25  2.502527  5.075934  0.130713
25     26  2.482153  4.582385  0.118019
26     27  2.460010  6.396225  0.108642 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.4258380915653674


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 8.45365916812958 and mIoU: 0.11212033486684167

     epoch     train  val_loss  val_mIoU
23     24  2.536203  3.726577  0.109498
24     25  2.502527  5.075934  0.130713
25     26  2.482153  4.582385  0.118019
26     27  2.460010  6.396225  0.108642
27     28  2.425838  8.453659  0.112120 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.400635153885441


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 4.078821048014693 and mIoU: 0.1106794790588442

     epoch     train  val_loss  val_mIoU
24     25  2.502527  5.075934  0.130713
25     26  2.482153  4.582385  0.118019
26     27  2.460010  6.396225  0.108642
27     28  2.425838  8.453659  0.112120
28     29  2.400635  4.078821  0.110679 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.378751511238583


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 4.3970421464606 and mIoU: 0.09824882698919413

     epoch     train  val_loss  val_mIoU
25     26  2.482153  4.582385  0.118019
26     27  2.460010  6.396225  0.108642
27     28  2.425838  8.453659  0.112120
28     29  2.400635  4.078821  0.110679
29     30  2.378752  4.397042  0.098249 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.372263618551824


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 3.964986970621295 and mIoU: 0.08878723837114802

     epoch     train  val_loss  val_mIoU
26     27  2.460010  6.396225  0.108642
27     28  2.425838  8.453659  0.112120
28     29  2.400635  4.078821  0.110679
29     30  2.378752  4.397042  0.098249
30     31  2.372264  3.964987  0.088787 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.346562250175382


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 5.690737417906781 and mIoU: 0.106368253307116

     epoch     train  val_loss  val_mIoU
27     28  2.425838  8.453659  0.112120
28     29  2.400635  4.078821  0.110679
29     30  2.378752  4.397042  0.098249
30     31  2.372264  3.964987  0.088787
31     32  2.346562  5.690737  0.106368 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.3189613471235524


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 7.646730501280496 and mIoU: 0.11013616599604936

     epoch     train  val_loss  val_mIoU
28     29  2.400635  4.078821  0.110679
29     30  2.378752  4.397042  0.098249
30     31  2.372264  3.964987  0.088787
31     32  2.346562  5.690737  0.106368
32     33  2.318961  7.646731  0.110136 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.307119036398298


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 4.304953877625896 and mIoU: 0.08128474300256859

     epoch     train  val_loss  val_mIoU
29     30  2.378752  4.397042  0.098249
30     31  2.372264  3.964987  0.088787
31     32  2.346562  5.690737  0.106368
32     33  2.318961  7.646731  0.110136
33     34  2.307119  4.304954  0.081285 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

Training done with loss: 2.2939544715833575


  0%|          | 0/1108 [00:00<?, ?it/s]

Validation done with loss: 4.502585445705568 and mIoU: 0.10497828251457024

     epoch     train  val_loss  val_mIoU
30     31  2.372264  3.964987  0.088787
31     32  2.346562  5.690737  0.106368
32     33  2.318961  7.646731  0.110136
33     34  2.307119  4.304954  0.081285
34     35  2.293954  4.502585  0.104978 

Checkpoint saved... 


  0%|          | 0/18985 [00:00<?, ?it/s]

# EXPLORATION REGION