## Getting Started

Notebooks allow you to execute snippets of code step-by-step by running a series of "cells".  As you work through these pre-made notebooks, you will execute each cell by selecting it and pressing *Shift+Enter*.  Make sure you execute all the cells in order!

Try executing the two cells below to test how this works.

In [None]:
# Set a variable called "hello" to the string "hello world"
hello = 'hello world'

In [None]:
# Reference the variable from the previous cell and print it
print(hello)

## Prepare the Data

Import the necessary libraries, read in the samples, and separate them into train/test datasets.

In [None]:
import boto3
import pandas as pd
from io import StringIO

samples_key = 'modules/dc759d1485734b1481cbd9beab219de2/v1/NotebookSamples.csv' # file with 2M pre-collected samples
all_samples = boto3.client('s3').get_object(Bucket = 'ee-assets-prod-us-east-1', Key = samples_key)['Body'].read().decode('UTF-8')
train_samples = ''.join(all_samples.splitlines(True)[:1500000]) # train on the first 1.5M samples
test_samples = ''.join(all_samples.splitlines(True)[1500000:]) # test on the last 500K samples

buckets = boto3.client('s3').list_buckets()['Buckets']
samples_bucket = [bucket['Name'] for bucket in buckets if bucket['Name'].startswith('aim368-samples-bucket-')][0]

boto3.client('s3').put_object(Bucket = samples_bucket, Key = 'TrainSamples.csv', Body = train_samples);
boto3.client('s3').put_object(Bucket = samples_bucket, Key = 'TestSamples.csv', Body = test_samples);

print('Done!')

## Inspect the Data

Just to remind you, here are the features we selected:
* **Time Features**
 * duration (the value we want our model to predict)
 * FC average duration
* **Item Features**
 * dimensions
 * quantity
 * weight
* **Bin Features**
 * location
 * dimensions
 * fullness
 * clutter

Read the data into a DataFrame to make it easier to work with, label each column, and display the first 10 samples.

In [None]:
pd.set_option('display.float_format', lambda x: '%i' % x)
df = pd.read_csv(StringIO(all_samples),
                 names = ['duration', 'fc_avg', 'item_length', 'item_width', 'item_height', 'item_quantity', 'item_weight',
                          'bin_location', 'bin_width', 'bin_height', 'bin_depth', 'bin_fullness', 'bin_clutter'])
df.head(10)

Make a table showing summary statistics for each feature in the dataset.

In [None]:
df.describe()

## Visualize the Data

Plot the distributions of each feature in the dataset.

In [None]:
df['duration'].sample(100000).plot.hist(bins = 200, title = 'Duration').set_xlabel("Duration (milliseconds)", size = 12);

In [None]:
df['fc_avg'].sample(100000).plot.hist(bins = 5, title = 'FC Average Duration').set_xlabel("Average Duration (milliseconds)", size = 12);

In [None]:
df['item_length'].sample(100000).plot.hist(bins = range(450)[::2], title = 'Product Length').set_xlabel("Length (millimeters)", size = 12);

In [None]:
df['item_width'].sample(100000).plot.hist(bins = range(400)[::2], title = 'Product Width').set_xlabel("Width (millimeters)", size = 12);

In [None]:
df['item_height'].sample(100000).plot.hist(bins = range(250)[::2], title = 'Product Height').set_xlabel("Height (millimeters)", size = 12);

In [None]:
df['item_quantity'].sample(100000).plot.hist(bins = 9, title = 'Product Quantity').set_xlabel("Count", size = 12);

In [None]:
df['item_weight'].sample(100000).plot.hist(bins = range(2500)[::25], title = 'Product Weight').set_xlabel("Weight (grams)", size = 12);

In [None]:
df['bin_location'].sample(100000).plot.hist(bins = 5, title = 'Bin Location').set_xlabel("Distance from Ground (millimeters)", size = 12);

In [None]:
df['bin_width'].sample(100000).plot.hist(bins = 5, title = 'Bin Height').set_xlabel("Height (millimeters)", size = 12);

In [None]:
df['bin_height'].sample(100000).plot.hist(bins = 5, title = 'Bin Width').set_xlabel("Width (millimeters)", size = 12);

In [None]:
df['bin_depth'].sample(100000).plot.hist(bins = 5, title = 'Bin Depth').set_xlabel("Depth (millimeters)", size = 12);

In [None]:
df['bin_fullness'].sample(100000).plot.hist(bins = range(100), title = 'Bin Fullness').set_xlabel("Fullness (percent)", size = 12);

In [None]:
df['bin_clutter'].sample(100000).plot.hist(bins = range(100), title = 'Bin Clutter').set_xlabel("Clutter (percent)", size = 12);