# Using the SageMaker SDK with built-in algorithms


***Being familiar with the SageMaker SDK is important to making the most of SageMaker.
You can find its documentation at https://sagemaker.readthedocs.io .
Walking through a simple example is the best way to get started. In this section, we'll use
the Linear Learner algorithm to train a regression model on the Boston Housing dataset.
We'll proceed very slowly, leaving no stone unturned. Once again, these concepts are
essential***

## Preparing data

***Built-in algorithms expect the dataset to be in a certain format, such as CSV, protobuf, or libsvm. Supported formats are listed in the algorithm documentation. For instance, Linear Learner supports CSV and recordIO-wrapped protobuf ( https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html#ll-input_output ). Our input dataset is already in the repository in CSV format, so let's use that. Dataset preparation will be extremely simple, and we'll run it manually: ***

In [7]:
!unzip archive.zip

Archive:  archive.zip
  inflating: Boston.csv              


In [9]:
# 1. Using pandas , we load the CSV dataset with pandas:

import pandas as pd
dataset = pd.read_csv('housing.csv')

In [10]:
# 2. Then, we print the shape of the dataset:

print(dataset.shape)


(506, 15)


In [11]:
# 3. Now, we display the first 5 lines of the dataset:

dataset[:5]

# This prints out the table visible in the following diagram. For each house, we see
# 12 features, and a target attribute ( medv ) set to the median value of the house in
# thousands of dollars:

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [12]:
# 4. Reading the algorithm documentation ( https://docs.aws.amazon.com/
# sagemaker/latest/dg/cdf-training.html ), we see that Amazon
# SageMaker requires that a CSV file doesn't have a header record and that the target
# variable is in the first column. Accordingly, we move the medv column to the front
# of the dataframe:
dataset = pd.concat([dataset['medv'],dataset.drop(['medv'], axis=1)],axis=1)


In [13]:
# 5. A bit of scikit-learn magic helps split the dataframe up into two parts: 90% for training, and 10% for validation:
from sklearn.model_selection import train_test_split
training_dataset, validation_dataset = train_test_split(dataset, test_size=0.1)


In [14]:
# 6. We save these two splits to individual CSV files, without either an index or a header:
training_dataset.to_csv('training_dataset.csv',index=False, header=False)
validation_dataset.to_csv('validation_dataset.csv',index=False, header=False)


In [15]:
# 7. We now need to upload these two files to S3. We could use any bucket, and
# here we'll use the default bucket conveniently created by SageMaker in the region
# we're running in. We can find its name with the sagemaker.Session.
# default_bucket() API:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()


In [16]:
# 8. Finally, we use the sagemaker.Session.upload_data() API to upload the
# two CSV files to the default bucket. Here, the training and validation datasets are
# made of a single file each, but we could upload multiple files if needed. For this
# reason, we must upload the datasets under different S3 prefixes, so that their files
# won't be mixed up:
prefix = 'boston-housing'
training_data_path = sess.upload_data(path='training_dataset.csv',key_prefix=prefix + '/input/training')
validation_data_path = sess.upload_data(path='validation_dataset.csv',key_prefix=prefix + '/input/validation')

print(training_data_path)
print(validation_data_path)



s3://sagemaker-us-east-1-562547773519/boston-housing/input/training/training_dataset.csv
s3://sagemaker-us-east-1-562547773519/boston-housing/input/validation/validation_dataset.csv


#### Configuring a training job

1. I hope we know that SageMaker algorithms are packaged in
Docker containers. Using boto3 and the image_uris.retrieve() API, we
can easily find the name of the Linear Learner algorithm in the region
we're running:


In [None]:
import boto3
from sagemaker import image_uris
region = boto3.Session().region_name
container = image_uris.retrieve('linear-learner', region)

2. Now that we know the name of the container, we can configure our training job
with the Estimator object. In addition to the container name, we also pass the
IAM role that SageMaker instances will use, the instance type and instance count
to use for training, as well as the output location for the model. Estimator will
generate a training job automatically, and we could also set our own prefix with the
base_job_name parameter:

In [None]:

from sagemaker.estimator import Estimator
ll_estimator = Estimator(container,role=sagemaker.get_execution_role(),
                         instance_count=1,
                         instance_type='ml.m5.large',
                         output_path='s3://{}/{}/output'.format(bucket,prefix))


SageMaker supports plenty of different instance types, with some differences across
AWS regions. You can find the full list at https://docs.aws.amazon.com/sagemaker/latest/dg/instance-types-az.html 
Which one should we use here? Looking at the Linear Learner documentation
( https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html#ll-instances ), we see that you can train the Linear Learner
algorithm on single- or multi-machine CPU and GPU instances. Here, we're working
with a tiny dataset, so let's select the smallest training instance available in our
region: ml.m5.large .
Checking the pricing page ( https://aws.amazon.com/sagemaker/
pricing/ ), we see that this instance costs $0.15 per hour in the eu-west-1 region
(the one I'm using for this job).


3. Next, we have to set hyperparameters. This step is possibly one of the most obscure
and most difficult parts of any machine learning project. Here's my tried and
tested advice: read the algorithm documentation, stick to mandatory parameters
only unless you really know what you're doing, and quickly check optional
parameters for default values that could clash with your dataset. 
Let's look at the documentation, and see which hyperparameters are mandatory
( https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) . As it turns out, there is only one: predictor_
type . It defines the type of problem that Linear Learner is training on (regression,
binary classification, or multiclass classification).
Taking a deeper look, we see that the default value for mini_batch_size is 1000:
this isn't going to work well with our 506-sample dataset, so let's set it to 32. We also
learn that the normalize_data parameter is set to true by default, which makes
it unnecessary to normalize data ourselves:


In [None]:
ll_estimator.set_hyperparameters(predictor_type='regressor',mini_batch_size=32)


In [None]:
"""4. Now, let's define the data channels: a channel is a named source of data passed to
a SageMaker estimator. All built-in algorithms need at least a train channel, and
many also accept additional channels for validation and testing. Here, we have two
channels, which both provide data in CSV format. The TrainingInput() API
lets us define their location, their format, whether they are compressed, and so on:"""

training_data_channel = sagemaker.TrainingInput(s3_data=training_data_path,
                                                content_type='text/csv')

validation_data_channel = sagemaker.TrainingInput(s3_data=validation_data_path,
                                                  content_type='text/csv')


**By default, data served by a channel will be fully copied to each training instance,
which is fine for small datasets. We'll study alternatives in Chapter 10, Advanced
Training Techniques.**