# S3 Parallel Uploads

`This notebook is a Python2 notebook`

There are few examples of uploading millions of small files from a local (or cloud based) compute instance with attached block storage up into S3. 

Frequently, files are compressed/zipped into a single file then uploaded to S3 and are then pulled down and unzipped at training time. 

However, in the case of image based training data, particularly where the total dataset size is larger than the memory of a single instance, having the ability to add additional images to storage w/o having to download/unzip/add/rezip/upload is essential to a startup wanting to manage their storage in a Cloud Object Storage bucket. 

I decided to run some tests using 8,000,000 MNIST images in sub-optimal file folder configuration (10 sub-folders for numbers zero to nine). I ran my tests on an AWS ml.m5.24xlarge instance. That features 96 cores, 384Gb Ram, and 25 Gigabit connection... basically directly into S3. 

The goal is simple: upload the files as fast as possible. 

In testing, the AWS CLI "s3 sync" and "s3 copy" menthods produce unsatisfactorily poor performance in these high quantity small file type uploads. My first test runs ran 4+ hours before I terminated them and looked for alternative methods. They also faced limitations on their ability to scale & effectively utilze multiple cores & saturate the bandwith of my testing instance. 

Attempts to improve the performance prior to abandoning the CLI included [modifying the configuration to increase the number of threads & cache size](https://aws.amazon.com/premiumsupport/knowledge-center/s3-improve-transfer-sync-command/), partitioning the data and running multiple instances of the AWS CLI shell commands using [filters](https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters) for parallel sync, and partitioning by subfolder (prefix) w/ parllel sync commands. 

In my research I tested a few other options including writing my own script using the Boto3. 
The best performing tools I found were [rclone](https://rclone.org/) & [s3-parallel-put](https://github.com/mishudark/s3-parallel-put)

Here is how I recommend to use them. 

## Rclone

In [2]:
# install
!curl https://rclone.org/install.sh | sudo bash

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4437  100  4437    0     0  13865      0 --:--:-- --:--:-- --:--:-- 13822
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    15  100    15    0     0     47      0 --:--:-- --:--:-- --:--:--    47

The latest version of rclone rclone v1.51.0 is already installed.



In [8]:
# Make an rclone config for s3. (Can also be done for GCP. Run `rclone config create` manually in terminal)
!rclone config create s3 s3 provider AWS env_auth true region us-east-1 acl private

2020/04/23 19:07:42 NOTICE: Config file "/home/ec2-user/.config/rclone/rclone.conf" not found - using defaults
Remote config
--------------------
[s3]
type = s3
provider = AWS
env_auth = true
region = us-east-1
acl = private
--------------------


In [None]:
# https://rclone.org/docs/
# https://rclone.org/s3/#amazon-s3

# Monitor w/ htop (networking) and tail -f 
# Run rclone without TPS rate limiting, but use massive error handling.
# Limit the number of "retries" to 3. 
# Transfers = 10x number of cores. 10 threads per core. 
# Max back log = # of files if you have enough ram to handle it. 
!rclone copy /home/ec2-user/SageMaker/amazon-sagemaker-examples/sagemaker-python-sdk/pytorch_horovod_mnist/mnist8m_img s3:sagemaker-scratch-1234o2ijwoer23423/sagemaker/pytorch-mnist8m_rclone_copyftn --stats-one-line-date --transfers=960 --max-backlog=82000000 --low-level-retries=5000 --fast-list --local-no-check-updated --retries=2 --log-file=rclone_log_file_copy --stats-log-level=NOTICE

# Rclone may log some errors if it is overwhelming s3 (> 3500 puts/s on a single prefix.)
# I recommend using `!tail -f /logfile` to observe the log outputs in real time. 
# I also used iftop & htop to monitor the CPU utilization & Network utilization. 


In [3]:
# Examining the Log Head
!head -2 /home/ec2-user/SageMaker/amazon-sagemaker-examples/sagemaker-python-sdk/pytorch_horovod_mnist/rclone_log_file_copy


2020/04/23 23:20:28 NOTICE: 2020/04/23 23:20:28 -    56.277M / 2.493 GBytes, 2%, 1.069 MBytes/s, ETA 38m54s (xfr#167438/8109993)



In [4]:
# Examining the Log Tail
!tail -2 /home/ec2-user/SageMaker/amazon-sagemaker-examples/sagemaker-python-sdk/pytorch_horovod_mnist/rclone_log_file_copy


2020/04/24 00:13:28 NOTICE: 2020/04/24 00:13:28 -     2.615G / 2.615 GBytes, 100%, 932.323 kBytes/s, ETA 0s



In [5]:
from datetime import timedelta
from datetime import datetime
seconds = (datetime.strptime('2020/04/24 00:13:28', '%Y/%m/%d %H:%M:%S') - datetime.strptime('2020/04/23 23:20:28', '%Y/%m/%d %H:%M:%S')).seconds
print ("It took " + str(seconds) + " seconds." )
print ("Or you could say it took " + str(seconds/60) + " minutes." )


It took 3180 seconds.
Or you could say it took 53 minutes.


In [6]:
speed = 2.615/seconds
print ("Average upload speed was around "+ str(speed*1000*1000) + " kBytes/second")

Average upload speed was around 822.327044025 kBytes/second


### Option 2: Use [s3-parallel-put](https://github.com/mishudark/s3-parallel-put)

This S3-Parallel-Put library works pretty well, but it is a Python2.x lib. 
It walks the filesystem, creates a queue, puts to S3, logs the result. Simple & Easy to use. 

It is fast because it uses multiprocessing on the backend. 
My only concern with it is if it would resume gracefully if it was to fail during a large upload. 

In [7]:
!pip install boto
!pip install python-magic
!sudo wget -O /usr/bin/s3-parallel-put https://raw.githubusercontent.com/mishudark/s3-parallel-put/master/s3-parallel-put
!sudo chmod +x /usr/bin/s3-parallel-put

[33mYou are using pip version 10.0.1, however version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting python-magic
  Downloading https://files.pythonhosted.org/packages/42/a1/76d30c79992e3750dac6790ce16f056f870d368ba142f83f75f694d93001/python_magic-0.4.15-py2.py3-none-any.whl
Installing collected packages: python-magic
Successfully installed python-magic-0.4.15
[33mYou are using pip version 10.0.1, however version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
--2020-04-30 00:34:27--  https://raw.githubusercontent.com/mishudark/s3-parallel-put/master/s3-parallel-put
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.200.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.200.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19600 (19K) [text/plain]
Saving to: ‘/usr/bin/s3-parallel-put’


2020-04

In [None]:
%%time
# https://github.com/mishudark/s3-parallel-put
# Wall time: 1hr 36min 23s
!s3-parallel-put --bucket=sagemaker-scratch-1234o2ijwoer23423 --prefix=sagemaker/pytorch-mnist8m-s3-parallel-upload --walk=filesystem --put=stupid --log-filename=s3-upload-log --insecure --processes=96 /home/ec2-user/SageMaker/amazon-sagemaker-examples/sagemaker-python-sdk/pytorch_horovod_mnist/mnist8m_img


CPU times: user 1min 45s, sys: 11.5 s, total: 1min 56s
Wall time: 1h 36min 23s


In [8]:
!tail -1 s3-upload-log

INFO:s3-parallel-put[statter-20506]:put 2674526724 bytes in 8110001 files in 5783.4 seconds (462452 bytes/s, 1402.3 files/s)


In [9]:
kBytesPerSec = 462452/1000
print ("Put speed was "+ str(kBytesPerSec) + " kBytes/Second")

Put speed was 462 kBytes/Second
