<center><font size="7"><b>Training</b></font></center>

## This notebook will focus on the choice of correct model for training and its preparation. Multiple models have been downloaded and are stored in models/pre-trained directory. All of the models were tested and below markdown comments will focus only on the model that was chosen as the best performing one.

## <b>1. Import modules required for this notebook</b>

In [1]:
import os
import tarfile
from shutil import rmtree
import re

## <b>2. Get paths for directories that will be used further in the code</b>

In [2]:
# Do not run this again if you run any of the "cd" commands later in the notebook because you will overwrite your home directory path
cur_dir = os.getcwd()

In [3]:
main_dir = os.path.dirname(cur_dir)
work_dir = os.path.join(main_dir, "public/Birds")
scripts_dir = os.path.join(main_dir, "scripts")
models_dir = os.path.join(main_dir, "models")

## <b>3. Download model from tensorflow and set a variable that will store a path to the chosen model</b>

In [4]:
model_name = "faster_rcnn_resnet101_v1_1024x1024_coco17_tpu-8"

In [5]:
%cd $models_dir
pretrained_dir = os.path.join(models_dir, "pre-trained")
model_path = os.path.join(models_dir, model_name)
arch = model_name + ".tar.gz"
tf_url = "http://download.tensorflow.org/models/object_detection/tf2/20200711"
# Clean directory if exists
if os.path.isdir(model_path):
    rmtree(model_path)
%cd $pretrained_dir
# Download and unpack the model basic configuration files
if arch not in os.listdir(pretrained_dir):
    download_model = os.path.join(tf_url, model_name + ".tar.gz")
    !wget $download_model
tar = tarfile.open(arch, "r:gz")
tar.extractall(models_dir)
tar.close()

# Clean directory by removing saved model and renaming pretrained checkpoint directory as "checkpoint" is reserved for the file
os.rename(os.path.join(model_path, "checkpoint"), os.path.join(model_path, "pre-checkpoint"))
rmtree(os.path.join(model_path, "saved_model"))

/home/michal/MSc_lin/7144COMP/Coursework_2/models
/home/michal/MSc_lin/7144COMP/Coursework_2/models/pre-trained


## <b>4. Edit model's hyperparameters</b>

### <b>4.1 Set hyperparameters</b>

#### All four classes are being used thus number of classes is set to 4. 
#### Batch size for both training and evaluation is set to 1 due to current hardware restrictions as GPU and CPU cannot handle higher number of batches. If training is performed on better configuration this value can be increased.
#### Number of training steps is set to 30000 as this setting provides the best training results and avoids overfitting of the model.
#### Learning rate of .0007 proved to be the best setting where training is being done in resonable time and overfitting does not occur.
#### To avoid early overfitting warmup learning rate is set to .00004.
#### It was concluded that 10% of total steps should be reserved as warmup steps for a good smooth start of the training.
#### Path to the last pre-trained checkpoint is set as well as the type is changed to detection.
#### Other hyperparameters involve path changes for label map, train and test TFrecords.

In [12]:
num_classes = 4
batch_size = 1
num_steps = 25000
learning_rate_base = ".0007"
total_steps = 25000
warmup_learning_rate = ".00004"
warmup_steps = 2500
checkpoint_path = os.path.join(model_path, "pre-checkpoint", "ckpt-0")
fine_tune_checkpoint_type = "detection"
label_map_path = os.path.join(work_dir, "label_map.pbtxt")
train_input_path = os.path.join(work_dir, "train.record")
eval_input_path = os.path.join(work_dir, "test.record")

### <b>4.2 Apply changes</b>

In [13]:
%cd $model_path
with open('pipeline.config') as f:
    file = f.read()
with open('pipeline.config', 'w') as f:
    # Set number of classes num_classes
    file = re.sub('num_classes: [0-9]+', 'num_classes: {}'.format(num_classes), file)
    
    # Set train and eval batch size
    file = re.sub('batch_size: [0-9]+', 'batch_size: {}'.format(batch_size), file)
    
    # Set number of training steps
    file = re.sub('num_steps: [0-9]+', 'num_steps: {}'.format(num_steps), file)
    
    # Set base learning rate
    file = re.sub('learning_rate_base: .[0-9]+', 'learning_rate_base: {}'.format(learning_rate_base), file)
    
    # Set total number of steps
    file = re.sub('total_steps: [0-9]+', 'total_steps: {}'.format(total_steps), file)
    
    # Set warmup learning rate
    file = re.sub('warmup_learning_rate: .[0-9]+', 'warmup_learning_rate: {}'.format(warmup_learning_rate), file)
    
    # Set number of warmup steps
    file = re.sub('warmup_steps: [0-9]+', 'warmup_steps: {}'.format(warmup_steps), file)
    
    # Set path to the pre trained checkpoint
    file = re.sub('fine_tune_checkpoint: ".*?"', 'fine_tune_checkpoint: "{}"'.format(checkpoint_path), file)
    
    # Set type of the checkpoint
    file = re.sub('fine_tune_checkpoint_type: ".*?"', 'fine_tune_checkpoint_type: "{}"'.format(fine_tune_checkpoint_type), file)
    
    # Set path to the label map file
    file = re.sub('label_map_path: ".*?"', 'label_map_path: "{}"'.format(label_map_path), file)
    
    # Set path to the train TFrecord file
    file = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/train)(.*?")', 'input_path: "{}"'.format(train_input_path), file)
    
    # Set path to the test TFrecord file
    file = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/val)(.*?")', 'input_path: "{}"'.format(eval_input_path), file)
    
    f.write(file)

/home/michal/MSc_lin/7144COMP/Coursework_2/models/faster_rcnn_resnet101_v1_1024x1024_coco17_tpu-8


### <b>4.3 Show model configuration</b>

In [14]:
%cat $model_path/pipeline.config

# Faster R-CNN with Resnet-50 (v1)
# Trained on COCO, initialized from Imagenet classification checkpoint

# This config is TPU compatible.

model {
  faster_rcnn {
    num_classes: 4
    image_resizer {
      fixed_shape_resizer {
        width: 1024
        height: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101_keras'
      batch_norm_trainable: true
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_lo

## <b>5. Initiate training</b>

In [8]:
%cd $scripts_dir
!python model_main_tf2.py --model_dir=$model_path --pipeline_config_path=$model_path/pipeline.config --num_train_steps=$num_steps --alsologtostderr

/home/michal/MSc_lin/7144COMP/Coursework_2/scripts
2020-12-08 19:31:47.958292: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-08 19:31:49.459396: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-08 19:31:49.484239: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-08 19:31:49.484590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:1c:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.86GHz coreCount: 46 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.23GiB/s
2020-12-08 19:31:49.484622: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-08 19:3