# Once your Groundtruth job has completed, lets prepare for training

This notebook walks you through the steps we have taken to process the object detection label output from Ground Truth to prepare it for model training in SageMaker. 

1. [Join together outputs from multiple labeling jobs](#join_output)
1. [Filter out labels that did not meet our quality bar](#filter_bad_labels)
1. [Generate TFRecords for import into Tensorflow](#generate_records)
1. [Upload Records to S3](#upload_records)
1. [data augmentation](#data_aug)

## Setup

In [1]:
BUCKET = 'cvml-sagemaker-repo'
JOB_NAME = 'wakeboarder-detection' 

### Import dependencies and define helper functions

In [2]:
import numpy as np
import random
import os, shutil
import json
import boto3
import botocore
import sagemaker

In [3]:
sagemaker_client = boto3.client('sagemaker')

def make_tmp_folder(folder_name):
    try:
        os.makedirs(folder_name, exist_ok=False)
    except FileExistsError:
        print("{} folder already exists".format(folder_name))
        
def read_manifest_file(file_path):
    with open(file_path, 'r') as f:
        output = [json.loads(line.strip()) for line in f.readlines()]
        return output

### Specify the Ground Truth labeling job id(s)  from the jobs section in the Sagemaker console.

In [4]:
## if using your own Ground Truth labeling job, replace below with appropriate job IDs
LABEL_JOB_IDS = ['wakeboarder-detection']


In [5]:
TMP_FOLDER_NAME = 'tmp' 
make_tmp_folder(TMP_FOLDER_NAME)


tmp folder already exists


## 1. Enable Flexibility and Join outputs from multiple jobs <a href id='join_output'></a>

To be able to iterate on Ground Truth jobs, this process can join outputs from multiple jobs. You can still run this cell if you have only 1 job. 

The below code takes one or more Ground Truth job IDs, download the output (Augmented Manifest File format) and join them together into one array for manipulation 

In [6]:
joined_outputs = []

def get_output_manifest_s3_uri(label_job_id):
    # uncomment below if you are using your own Ground Truth labeling job 
     return sagemaker_client.describe_labeling_job(LabelingJobName=label_job_id)['LabelingJobOutput']['OutputDatasetS3Uri']

for label_job_id in LABEL_JOB_IDS: 
    output_manifest_s3_uri = get_output_manifest_s3_uri(label_job_id)
    output_manifest_fname = "{}-{}".format(label_job_id, os.path.split(output_manifest_s3_uri)[1])
    !aws s3 cp $output_manifest_s3_uri $TMP_FOLDER_NAME/$output_manifest_fname
    output_manifest_local_path = os.path.join(TMP_FOLDER_NAME, output_manifest_fname)
    output_manifest_lines = read_manifest_file(output_manifest_local_path)
    print("loaded {} lines from {}".format(len(output_manifest_lines), output_manifest_local_path))
    joined_outputs += output_manifest_lines
    
print("loaded total of {} lines".format(len(joined_outputs)))

download: s3://cvml-sagemaker-repo/ground_truth/output/wakeboarder-detection/manifests/output/output.manifest to tmp/wakeboarder-detection-output.manifest
loaded 298 lines from tmp/wakeboarder-detection-output.manifest
loaded total of 298 lines


## View Example labels

In [7]:
joined_outputs[15]

{'source-ref': 's3://cvml-sagemaker-repo/frames/wakeboarding_001055.jpg',
 'wakeboarder-detection': {'annotations': [{'class_id': 0,
    'width': 483,
    'top': 125,
    'height': 544,
    'left': 306}],
  'image_size': [{'width': 1280, 'depth': 3, 'height': 720}]},
 'wakeboarder-detection-metadata': {'job-name': 'labeling-job/wakeboarder-detection',
  'class-map': {'0': 'wakeboard'},
  'human-annotated': 'yes',
  'objects': [{'confidence': 0.09}],
  'creation-date': '2020-03-12T20:02:31.826196',
  'type': 'groundtruth/object-detection'}}

In [8]:
joined_outputs[-15]

{'source-ref': 's3://cvml-sagemaker-repo/frames/wakeboarding_018743.jpg',
 'wakeboarder-detection': {'annotations': [],
  'image_size': [{'width': 1280, 'depth': 3, 'height': 720}]},
 'wakeboarder-detection-metadata': {'job-name': 'labeling-job/wakeboarder-detection',
  'class-map': {},
  'human-annotated': 'yes',
  'objects': [],
  'creation-date': '2020-03-12T20:45:58.452830',
  'type': 'groundtruth/object-detection'}}

## 2. MANUAL REVIEW: Discard any bad labels from visual inspection <a href id="filter_bad_labels"></a>

In the __Sagemaker console__, under labeling jobs, select your completed labeling job name. Review any images that don't meet the quality bar and you can remove them from the list by placing them in the list array below: (remove any .jpg file extensions)

In [9]:
TO_DISCARD = set([
    'wakeboarding_003299',
    'wakeboarding_007127',
    'wakeboarding_007391',
    'wakeboarding_008183',
    'wakeboarding_008777',
    'wakeboarding_013331',
    'wakeboarding_013397',
    'wakeboarding_013925',
    'wakeboarding_015905',
    'wakeboarding_015971',
    'wakeboarding_017555',
    'wakeboarding_017753',
    'wakeboarding_017819',
    'wakeboarding_018149',
    'wakeboarding_018611',
    'wakeboarding_018677',
    'wakeboarding_018743',
    'wakeboarding_019535'

    
])

In [20]:
filtered_manifest = []
recover_list = []
count_filtered = 0
for line in joined_outputs:
    filename= os.path.split(line["source-ref"])[1]
    imageid = os.path.splitext(filename)[0]
    if imageid not in TO_DISCARD:
        filtered_manifest.append(line)
        recover_list.append(filename)
    else:
        count_filtered+=1    
print("filtered out {} labels. {} labels remains".format(count_filtered, len(filtered_manifest)))


filtered out 18 labels. 280 labels remains


In [21]:
s3 = boto3.resource('s3')

for image in recover_list:
    s3.Bucket('cvml-sagemaker-repo').download_file('frames/{}'.format(image), image)

## Generate TF Records for Tensorflow <a href id='generate_records'></a

Recall the folder where images were extracted from our video in the last notebook, and set the FULL __IMAGE_PATH__ in the cell following:

In [22]:
!ls -d /home/ec2-user/SageMaker/amazon-sagemaker-aws-greengrass-custom-object-detection-model/data-prep/tmp/*/

/home/ec2-user/SageMaker/amazon-sagemaker-aws-greengrass-custom-object-detection-model/data-prep/tmp/wakeboarding/


In [32]:
#Create a reference to the images stored locally
IMAGE_PATH = '/home/ec2-user/SageMaker/amazon-sagemaker-aws-greengrass-custom-object-detection-model/data-prep/tmp/wakeboarding'

# Write out the updated manifest from above in memory list activities
SCRUBBED_MANIFEST = './{}/scrubbed.manifest'.format(TMP_FOLDER_NAME)

with open(SCRUBBED_MANIFEST, 'w') as f:
    for item in filtered_manifest:
        f.write(json.dumps(item))
        f.write('\n')
print(SCRUBBED_MANIFEST)       

./tmp/scrubbed.manifest


In [33]:
import os

#Create a local directory to hold the output of the TF Record process
!mkdir $TMP_FOLDER_NAME/tf_records
OUTPUT_DIR = './{}/tf_records'.format(TMP_FOLDER_NAME)

#Create class mapping here - e.g. - each class ID should map to the human readable equivalent
LABEL_MAP = {'0': 'wakeboard'}
print(OUTPUT_DIR)

./tmp/tf_records


#### Please read - a TODO:
The below code calls upon a file [tf_record_util.py](./utils/tf_record_util.py), that is responsible for managing the binary assembly and split of the dataset into a single TFRecord. This file has a hardcoded class value that needs to be reworked to be more dynamic. You can view the contents of this file in the appendix below, if you like. Note the reference to __'annotation_dict['wakeboarder-detection']['annotations'])'__ which needs to be changed to your class annotation if different. [View file here](#appendix)

In [34]:
import os
from utils.tf_record_util import TfRecordGenerator

# Feed in necessary path variables from above operations
tf_record_generator = TfRecordGenerator(image_dir=IMAGE_PATH,
                                        manifest=SCRUBBED_MANIFEST,
                                        label_map=LABEL_MAP,
                                        output_dir=OUTPUT_DIR)

print('GENERATING TF RECORD FILES')
tf_record_generator.generate_tf_records()

print('GENERATING LABEL MAP FILE')
with open(f'{OUTPUT_DIR}/label_map.pbtxt', 'w') as label_map_file:
    for item in LABEL_MAP:
        label_map_file.write('item {\n')
        label_map_file.write(' id: ' + str(int(item) + 1) + '\n')
        label_map_file.write(" name: '" + LABEL_MAP[item] + "'\n")
        label_map_file.write('}\n\n')
        
print('FINISHED')

GENERATING TF RECORD FILES
GENERATING LABEL MAP FILE
FINISHED


## Upload TFRecords AND LABEL MAP TO S3  <a href id='upload_records'></a

In [37]:
s3.Bucket(BUCKET).upload_file(f'{OUTPUT_DIR}/label_map.pbtxt', 'tfrecords/label_map.pbtxt')
s3.Bucket(BUCKET).upload_file(f'{OUTPUT_DIR}/train.records', 'tfrecords/train.records')
s3.Bucket(BUCKET).upload_file(f'{OUTPUT_DIR}/validation.records', 'tfrecords/validation.records')

## 5. SKIP: Optional Data augmentation <a href id='data_aug'></a> (NOT YET ADAPTED TO TF)

In [None]:
%%time
%run ./flip_images.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -b $BUCKET

In [None]:
%run ./flip_annotations.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -p $JOB_NAME

# Next step

Now we are ready to start training jobs! Move on to the [next notebook](./03_Sagemaker_Training_TF.ipynb) to submit a sagemaker training job to train our custom object detection model!

## Informational Appendix: <a href id="appendix"></a>

The cell below reveals the content of the to-be-reworked TFRecord converter. It works, but requires manual update to account for the desired class name (yes, only 1 class).

In [39]:
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
import IPython

with open('./utils/tf_record_util.py') as f:
    code = f.read()

formatter = HtmlFormatter()
IPython.display.HTML('<style type="text/css">{}</style>{}'.format(
    formatter.get_style_defs('.highlight'),
    highlight(code, PythonLexer(), formatter)))