# EMR Notebook SageMaker Custom Abalone Ring Estimator

1. [Setup](#Setup)
2. [Load the Data](#Load-the-Data)
3. [Train the Model](#Train-the-Model)
4. [Inference Results](#Inference-Results)
5. [Wrap-Up](#Wrap-Up)

## Setup - You MUST specify the user specific parameters below.

**Enter the SageMaker Execution Role ARN that you created earlier in the IAM console.**

**Enter the region code corresponding to the region your EMR cluster is in.**
You can look up the region code (us-east-1 for North Virginia, for example) on [this page](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html). 

In [None]:
# *****DEFINE USER SPECIFIC PARAMETERS******
region = ''
sagemaker_execution_role = ''

#T he number of EMR nodes to process the data.
num_workers = 12

# The location of the dataset we will be using. 
source_bucket = 'ee-assets-prod-us-east-1'
source_key = 'modules/8560b4bd6942403b9fe3291928df2453/v1/data/abalone.csv'

if (region and source_bucket and sagemaker_execution_role and num_workers):
    print('All necessary user parameters are entered.')
else:
    print('Please check to make sure you entered all default parameters!')

Each EMR notebook is launched with its own Spark context (variable sc). A Spark Context is the entry point for communication with Spark. First you need to install the Python packages that you'll use throughout the notebook. EMR notebooks come with a default set of libraries for data processing. You can see which libraries are installed on the notebook by calling the Spark Context's list_packages() function. 

In [None]:
sc.list_packages()

To comunicate with SageMaker you need to install notebook scoped libraries. These libraries are available only during the notebook session. After the session ends, the libraries are deleted. 

We install [boto3 (the AWS Python 3 SDK)](https://aws.amazon.com/sdk-for-python/) and the [high level SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/). 

In [None]:
sc.install_pypi_package("boto3==1.16.9");
sc.install_pypi_package('sagemaker==2.16.1');

In [None]:
import boto3
import sagemaker

#We initiate a session for the boto3 and sagemaker APIs. The session includes information necessary to call the
#AWS APIs, such as AWS credentials and default AWS region. For this lab we will leverage the IAM role attached to
#the EMR notebook, so we only need to provide a region.
boto_sess = boto3.Session(region_name=region)
sage_sdk_session = sagemaker.Session(boto_session=boto_sess)
bucket = sage_sdk_session.default_bucket()

print('A SageMaker session was initiated! You are using {} as your S3 bucket for intermediate files.'.format(bucket))

## Load the Data

We will use the public abalone data set from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Abalone)
to train and test a regression model.

   Given in the dataset is the attribute name, attribute type, the measurement unit and a
   brief description.  The number of rings is the value to predict: either
   as a continuous value or as a classification problem.
   
   The age of an abalone is the number of rings in the shell + 1.5 years. Without a model researchers must cut through the abalone shell
   and use a microscope to count the rings. Using a model to predict rings eliminates this time consuming process.

	Name			Data Type		Meas.	Description
	----			---------		-----	-----------
	Rings			integer					+1.5 gives the age in years
	Length			continuous		mm		Longest shell measurement
	Diameter		continuous		mm		perpendicular to length
	Height			continuous		mm		with meat in shell
	Whole weight	continuous		grams	whole abalone
	Shucked weight	continuous		grams	weight of meat
	Viscera weight	continuous		grams	gut weight (after bleeding)
	Shell weight	continuous		grams	after being dried
	Male			integer			1/0 	1 encodes true, 0 false
	Female			integer			1/0 	1 encodes true, 0 false
	Infant			integer			1/0 	1 encodes true, 0 false

First, we need to copy the public files to the S3 bucket in our account.

In [None]:
s3 = boto3.client('s3', region_name=region)

# Local bucket S3 prefix to store the data under.
local_key = 'data/abalone.csv'

s3.copy(CopySource={'Bucket' : source_bucket,
                    'Key' : source_key}, 
        Bucket=bucket, 
        Key=local_key)

In [None]:
# Read the dataset from S3 in to a Spark dataframe.
abalone_data = spark.read.load(f's3://{bucket}/{local_key}', format='csv', inferSchema=True, header=True).repartition(num_workers)
abalone_data.show(n=5)

Now that the data is in Spark we can modify and enhance our data. As an example, including all four abalone weights may be unnecessary. What really matters may be the difference between the whole weight and the shell weight. Making such changes on large datasets can be done easily in Spark.

Let's try adding a column that is the difference between whole weight and shell weight. Then remove the whole, shucked, weight, and shell weight columns. 

In [None]:
abalone_data = abalone_data.withColumn('Difference_weight', abalone_data.Whole_weight - abalone_data.Shell_weight)
abalone_data = abalone_data.drop('Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight')
abalone_data.show(n=5)

In [None]:
# Split the dataframe in to training and validation data.
# The training will be used to refine our model.
# The test data will be used to measure the model's accuracy.
train_data, test_data = abalone_data.randomSplit([.75,.25])

s3_train = f's3://{bucket}/train/'
s3_test = f's3://{bucket}/test/'
data_format = 'csv'

# Save the data in to S3 for training by SageMaker
train_data.write.save(s3_train, format=data_format, mode='overwrite')
test_data.write.save(s3_test, format=data_format, mode='overwrite')

print(f'Training dataset saved in {data_format} format to {s3_train}!')
print(f'Testing dataset saved in {data_format} format to {s3_test}!')

## Train the Model
SageMaker contains several common built-in algorithms. For this lab you have the choice of using either the [LinearLearner](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) or [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) built-in algorithms. Both are regression models that estimate the number of rings on the abalone.

In [None]:
# Uncomment the LinearLearner line to use the LinearLearner algorithm. 
model = 'xgboost'
#model = 'linear-learner'

print('The SageMaker {} model will be used.'.format(model))

The following cell defines the hyperparameters for each algorithn. You may leave them as the defaults, but if you are interested you could try changing a few to see if it improves model performance.

In [None]:
# Set the regularization weights. Increasing these will reduce how closely the model fits to the training data.
l1 = .25
l2 = .25

# Hyperparameters for XGBoost algorithm
xgboost_params = {
    'num_round':100,
    'objective': 'reg:linear',
    'alpha': l1,
    'lambda': l2
}

# Hyperparameters for LinearLearner algorithm
linear_params = {
    'feature_dim':len(abalone_data.columns)-1,
    'predictor_type': 'regressor',
    'loss': 'squared_loss',
    'l1': l1,
    'wd': l2
}

hyperparams = {
    'linear-learner': linear_params,
    'xgboost': xgboost_params
}

print('All model parameters have been set!')

In [None]:
from sagemaker.image_uris import retrieve

estimator = sagemaker.estimator.Estimator(
    image_uri=retrieve(framework=model, region=region, version='latest', py_version='py3'), 
    role=sagemaker_execution_role, 
    train_instance_count=1, 
    train_instance_type='ml.m5.large',
    sagemaker_session=sage_sdk_session, 
    hyperparameters=hyperparams[model]
)

print('The SageMaker model was constructed with parameters: {}.'.format(estimator.hyperparameters()))

Now that we initialized the model, we can train the model by calling the fit() function. After calling fit(), SageMaker will create a training instance, train a model on the instance, save the model artifacts to S3, then take down the training instance.

This usually takes about 3 minutes. 

(**Optional**) While you wait, you may check the model training progress through the SageMaker console by following these instructions:  
a.	Open SageMaker console in AWS.  
b.	On the left panel, scroll until you see ‘training jobs’ beneath the ‘Training’ section.  
c.	Click into the job to examine further details; wait until you see the status change to ‘Completed’.


In [None]:
train_channel = sagemaker.session.s3_input(s3_train + 'part', content_type='text/csv')
estimator.fit({'train': train_channel})

## Inference Results

How well did our model perform? Let's see how it does on the test data set we saved to S3 earlier. We'll use [SageMaker batch transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) to run our test data set through our model. Batch transform creates a SageMaker instance, deploys the model, runs the dataset through the model, then takes down the instance. 

In [None]:
s3_inference = s3_train.replace('train', 'inference')

transformer = estimator.transformer(
    instance_count = 1,
    instance_type = 'ml.m5.large',
    strategy = 'MultiRecord',
    output_path = s3_inference,
    assemble_with= 'Line',
    accept=('text/'+data_format)
)

print('SageMaker batch transform initialized with the following parameters:')
for key in transformer.__dict__:
    print('{}:{}'.format(key, transformer.__dict__[key]))

The transform() function initiates the SageMaker batch transform job. SageMaker will create an inference instance, run the specified test set through the model, save the results to S3, and take down the inference instance. Batch transform is a great option if you require inference for large datasets and don't need sub-second response time.

This usually takes 3 minutes. 

(**Optional**) While you wait, you may check the batch transform progress through the SageMaker console by following these instructions:  
a.	Open SageMaker console in AWS.  
b.	On the left panel, scroll until you see ‘Batch transform jobs’ beneath the ‘Inference’ section.  
c.	Click into the job to examine further details; wait until you see the status change to ‘Completed’.

In [None]:
# The test data set still contains the "Rings" column the model tries to predict. 
# We do not want to send this column to the model, though. We use the SageMaker
# input_filter to filter out that column before sending to the model. We then
# join the model output with the input so we can compare the actual Rings count
# to the predicted count.
transformer.transform(
    data=s3_test,
    content_type='text/csv',
    split_type='Line',
    input_filter='$[1:]',
    join_source='Input',
    wait=True
)

SageMaker batch transform completed and saved the model inference results to S3. Now let's pull the results in to Spark for analysis.

In [None]:
from copy import deepcopy
from pyspark.sql.types import FloatType

# Read the schema from the initial dataset so you can apply it to the inference data.
schema = deepcopy(abalone_data.schema)
schema.add("Estimated_rings", FloatType())

# Pull down the inference data from S3
inference_data = spark.read.load(s3_inference, format=data_format, schema=schema).repartition(num_workers)
inference_data.show(n=5)

Now that we have our results, we need to quantify our model's performance. We will use root mean square error (RMSE) to measure how close Estimated_rings is to the actual Rings value.

RMSE is a popular way to measure how closely a regression model predicts a response. A lower RMSE indicates a closer prediction.

Here is the equation for RMSE:

\begin{equation*}
RMSE = \sqrt{\frac{\sum_{i=1}^n (\hat{y_i}-y_i)^2}{N}}
\end{equation*}

where $\hat{y_i}$ is the number of predicted rings, $y_i$ is the observed number of rings, and N is the number of rows in the test data set.

We'll use Spark SQL to run a SQL query on our data to calculate the RMSE.

In [None]:
rings = inference_data.schema.names[0]
predicted_rings = inference_data.schema.names[-1]
table_name = 'inference'

inference_data.registerTempTable(table_name)
sql_rmse = 'SELECT SQRT(AVG(POWER({}-{}, 2))) AS RMSE FROM {}'.format(rings, predicted_rings, table_name)

rmse_results = spark.sql(sql_rmse)
rmse_results.show()

## Wrap-Up
Congratulations! You processed data in Apache Spark on EMR and trained and deployed a machine learning model in Amazon SageMaker! Feel free to try different combinations of models and hyperparameters to see if you can reduce your model's RMSE.