# Streaming Training Data From S3

So far in the book we used Amazon S3 service to store our training datasets. By default, SageMaker downloads full dataset to each training node which can be problematic for large DL datasets (known as `FullySharded` distribution strategy). In previous example `Chapter04/2_Distributed_Data_Processing.ipynb` we also learned how to use `ShardedByKey` distribution strategy that will reduce the amount of data downloaded to each training node. However, that approach only reduces the amount of data that needs to be downloaded to your training nodes. For large datasets (100s+ gigabytes) it solves the problem only partially.

Alternative approach to reduce training time is to stream data from Amazon S3 without downloading it upfront. There are several implementations of S3 data streaming provided by Amazon SageMaker: 
- Framework specific streaming implementations: TensorFlow `PipeModeDataset` and `S3 Plugin` for PyTorch 
- Framework agnostic `FastFile` mode  

In this example we will learn how to use `PipeModeDataset` streaming feature to train TensorFlow model.  For this we will convert CIFAR-100 dataset into `TFRecords` format and then stream this dataset at training time using `PipeModeDataset` from SageMaker TensorFlow extyension library ([link](https://github.com/aws/sagemaker-tensorflow-extensions)).

### Prerequisites 
To run this example you need to have `tensorflow` and `wget` packages installed. Feel free to run cell below to install requried dependencies.


In [None]:
! pip install -r requirements.txt

## Converting Data to TFRecords

`PipeModeDataset` is an open-source implementation of TensorFlow Dataset API which allows to read SageMaker Pipe Mode channels. PipeModeDataset supports several formats of datasets, such as text line, RecordIO, and TFRecord. We chose to use TFRecord format in this example.

We start by converting original dataset into TFRecord format. For this we prepared a conversion script `3_sources/generate_cifar100_records.py`. Here are several highlights from the script:
- Method `download_and_extract()` (line #31) downloads and unarchives CIFAR-100 datasets.
- Method `convert_to_tfrecord()` (line #62) iterates other dataset, for each a pair of images and labels into TensorFlow `Example` class, and writes batch of Example objects into single TFRecord file.
- Methods `_int64_feature()` (line #38) and `_bytes_feature()` (line #42) convert images and class labels to expected types.

Execute the cell below to review conversion script.

In [None]:
conversion_script = "3_sources/generate_cifar100_records.py"
data_dir = "cifar100_data"

! pygmentize -O linenos=1  $conversion_script

Run the cell below to perform conversion.

In [None]:
! python  $conversion_script --data-dir $data_dir

Once dataset is converted to TF Record format, we upload it to Amazon S3 location using SageMaker SDK `S3Uploader` class. 

In [None]:
import sagemaker
from sagemaker.s3 import S3Uploader
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
dataset_uri = S3Uploader.upload(data_dir, "s3://{}/tf-cifar10-example/data".format(bucket))

print(dataset_uri)

## Developing Training Script

Next, we need to prepare training script. Execute the cell below to preview training script. Our training script largely follows a typical TensorFlow training script, however, there are several differences:
- We import `PipeModeDataset` class from `sagemaker_tensorflow` (line #22) and use it as input dataset to our training. 
- Method `_dataset_parser()` (line #143) implements parsing logic for dataset.
- Method `_input()` returns parsed data sample and classes and is used to instantiate training and evaluation datasets (lines #184-185)

The rest of training script is similar to other Keras application. 

In [None]:
! pygmentize -O linenos=1  3_sources/train.py

## Running Training Job

Now, we are ready to run our training job which will stream data from S3 location. When configuring SageMaker training job, we need to explicitly specify that we want to use `PipeMode`. Prior to this we define hyperparameters and metrics.

In [13]:
hyperparameters: {
    "batch-size": 256,
    "epochs": 10
    }


metric_definitions = [
    {"Name": "train:loss", "Regex": ".*loss: ([0-9\\.]+) - accuracy: [0-9\\.]+.*"},
    {"Name": "train:accuracy", "Regex": ".*loss: [0-9\\.]+ - accuracy: ([0-9\\.]+).*"},
    {
        "Name": "validation:accuracy",
        "Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: ([0-9\\.]+).*",
    },
    {
        "Name": "validation:loss",
        "Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_accuracy: [0-9\\.]+.*",
    },
    {
        "Name": "sec/steps",
        "Regex": ".* - \d+s (\d+)[mu]s/step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: [0-9\\.]+",
    },
]


Run the cell below to start the training job. Note, that we are setting `input_mode="Pipe"` as part of estimator configuration.

In [None]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    entry_point="train.py", 
    source_dir="3_sources", 
    metric_definitions=metric_definitions, 
    hyperparameters=hyperparameters, 
    role=role, 
    framework_version="1.15.2", 
    py_version="py3", 
    train_instance_count=1, 
    input_mode="Pipe", 
    train_instance_type="ml.p2.xlarge", 
    base_job_name="cifar100-tf", 
)

inputs = {
    "train": "{}/train".format(dataset_uri),
    "validation": "{}/validation".format(dataset_uri),
    "eval": "{}/eval".format(dataset_uri),
}

estimator.fit(inputs)

You can now observe training job performance in AWS Console. You may notice that training job started faster as we avoided time on initial data download. Note, that since CIFAF100 dataset is relatively small, you may be not able to see any considerable decrease of training start time. However, with bigger datasets like COCO2017 you can expect to training time reduced by at least several of minutes