# Train, tune, and deploy a custom ML model using SmartDescriptions Algorithm from AWS Marketplace 


SmartDescriptions is a data-to-text solution that allows you to generate text from structured data. With the SmartDescriptions solution you can save time by generating thousands of comprehensible texts automatically.

Large companies whose business model is the sale are always faced with the challenge of describing their hundreds of products and services. With this solution you can create products descriptions at the push of a button.

You can finetune the SmartDescriptions solution to start generating texts according to your business domain. 


This sample notebook shows you how to train a custom ML model using [SmartDescriptions](https://aws.amazon.com/marketplace/management/ml-products/a2b91337-b40d-4eb3-a915-53c42f01ccea?) from AWS Marketplace.

This Algorithm was developed by adapting and finetuning a HuggingFace model.

## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [SmartDescriptions](https://aws.amazon.com/marketplace/management/ml-products/a2b91337-b40d-4eb3-a915-53c42f01ccea?)


## Contents
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure and visualize train and test dataset](#B.-Configure-and-visualize-train-and-test-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Train a machine learning model](#3:-Train-a-machine-learning-model)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train a model](#3.2-Train-a-model)
1. [Deploy model and verify results](#4:-Deploy-model-and-verify-results)
    1. [Deploy trained model](#A.-Deploy-trained-model)
    1. [Create input payload](#B.-Create-input-payload)
    1. [Perform real-time inference](#C.-Perform-real-time-inference)
    1. [Visualize output](#D.-Visualize-output)
    1. [Delete the endpoint](#E.-Delete-the-endpoint)
1. [Perform Batch inference](#5.-Perform-Batch-inference)
1. [Clean-up](#6.-Clean-up)
	1. [Delete the model](#A.-Delete-the-model)


## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page [SmartDescriptions](https://aws.amazon.com/marketplace/management/ml-products/a2b91337-b40d-4eb3-a915-53c42f01ccea?)
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [2]:
algo_arn = "<AlgorithmARN>" # Replace this with your algorithm ARN 

## 2. Prepare dataset

In [3]:
import sagemaker as sage
import os
import json
import pandas as pd
import boto3
from sagemaker import get_execution_role
from sagemaker import AlgorithmEstimator

### A. Dataset format expected by the algorithm

This algorithm requires a **JSON line** file, each line will contain an object with the attributes "data" and "response" in this respective order. The attribute "data" corresponds to your structured data separated by ";", and the attribute "response" corresponds to your desire response from the structured data specified.

The JSON line file should look like this:

In [4]:
{"data": "<name> name = [ 1 Hotel South Beach ] ; <address> address = [ 2341 Collins Ave, Miami Beach, FL 33139, USA ] ; <feelsHotel> feelsHotel = [ luxury ] ; <hasBarOnsite> hasBarOnsite = [ yes ] ; <hasDeskInRooms> hasDeskInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasRoomsUpgraded> hasRoomsUpgraded = [ yes ] ; <hasKitchenInRoom> hasKitchenInRoom = [ yes ]", "response": "A good choice is 1 Hotel South Beach in Miami Beach. It's luxurious, with a bar onsite. The upgraded rooms are full featured, including a kitchen and a desk for work. Each room also has a balcony."}
{"data": "<name> name = [ Ala Moana Hotel ] ; <address> address = [ 410 Atkinson Drive, Honolulu, HI 96814, USA ] ; <feelsHotel> feelsHotel = [ casual ] ; <hasConventionCenter> hasConventionCenter = [ yes ] ; <hasOnsiteCafe> hasOnsiteCafe = [ yes ] ; <hasCoffeeInRooms> hasCoffeeInRooms = [ yes ] ; <hasMinifridgeInRooms> hasMinifridgeInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasMicrowaveInRooms> hasMicrowaveInRooms = [ yes ]", "response": "It would be good idea to consider Ala Moana Hotel 410 Atkinson Drive, Honolulu, HI 96814, USA. It's casual, with an onsite cafe and a convention center. This hotel has microwave, mini fridge , coffee in rooms and balcony."}

{'data': '<name> name = [ Ala Moana Hotel ] ; <address> address = [ 410 Atkinson Drive, Honolulu, HI 96814, USA ] ; <feelsHotel> feelsHotel = [ casual ] ; <hasConventionCenter> hasConventionCenter = [ yes ] ; <hasOnsiteCafe> hasOnsiteCafe = [ yes ] ; <hasCoffeeInRooms> hasCoffeeInRooms = [ yes ] ; <hasMinifridgeInRooms> hasMinifridgeInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasMicrowaveInRooms> hasMicrowaveInRooms = [ yes ]',
 'response': "It would be good idea to consider Ala Moana Hotel 410 Atkinson Drive, Honolulu, HI 96814, USA. It's casual, with an onsite cafe and a convention center. This hotel has microwave, mini fridge , coffee in rooms and balcony."}

**Note:** You should name your data file with extension .json

You can also find more information about dataset format in **Usage Information** section of [SmartDescriptions](https://aws.amazon.com/marketplace/management/ml-products/a2b91337-b40d-4eb3-a915-53c42f01ccea?)

### B. Configure and visualize train and test dataset

You must upload the training dataset into data/train directory and update the `training_file_name` parameter value in following cell. **If you intend to download it at run-time, add relevant code in following cell.**

In [5]:
training_file_name = "<FileName.json>"
training_dataset = "./data/train/{}".format(training_file_name)

In [None]:
training_dataset

In [7]:
data = []
with open(training_dataset) as f:
    for line in f:
        data.append(json.loads(line))

In [9]:
#Show the training data
data[:3]

[{'data': '<name> name = [ 1 Hotel South Beach ] ; <address> address = [ 2341 Collins Ave, Miami Beach, FL 33139, USA ] ; <feelsHotel> feelsHotel = [ luxury ] ; <hasBarOnsite> hasBarOnsite = [ yes ] ; <hasDeskInRooms> hasDeskInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasRoomsUpgraded> hasRoomsUpgraded = [ yes ] ; <hasKitchenInRoom> hasKitchenInRoom = [ yes ]',
  'response': "A good choice is 1 Hotel South Beach in Miami Beach. It's luxurious, with a bar onsite. The upgraded rooms are full featured, including a kitchen and a desk for work. Each room also has a balcony."},
 {'data': '<name> name = [ 1 Hotel South Beach ] ; <address> address = [ 2341 Collins Ave, Miami Beach, FL 33139, USA ] ; <feelsHotel> feelsHotel = [ luxury ] ; <hasBarOnsite> hasBarOnsite = [ yes ] ; <hasDeskInRooms> hasDeskInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasRoomsUpgraded> hasRoomsUpgraded = [ yes ] ; <hasKitchenInRoom> hasKitchenInRoom = [ yes ]',
  '

### C. Upload datasets to Amazon S3

In [None]:
sagemaker_session = sage.Session()
bucket = sagemaker_session.default_bucket()
bucket

In [None]:
training_data = sagemaker_session.upload_data(
    training_dataset, bucket=bucket, key_prefix="smart-descriptions/train"
)
training_data

## 3: Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [None]:
role = get_execution_role()
role

In [13]:
output_location = "s3://{}/smart-descriptions/{}".format(
    bucket, "output"
)

### 3.2 Train a model

You can also find more information about dataset format in **Hyperparameters** section of [SmartDescriptions](https://aws.amazon.com/marketplace/management/ml-products/a2b91337-b40d-4eb3-a915-53c42f01ccea?)

In [14]:
# Define hyperparameters
# These hyperparameters can be set by your requirements, the only hyperparameter that can't be changed is train_file 
hyperparameters = {
    'train_file':'/opt/ml/input/data/train/{}'.format(training_file_name),
    'num_train_epochs': 1,
    'per_device_train_batch_size': 8,
    'per_device_eval_batch_size': 8
}

In [None]:
# Create an estimator object for running a training job

instance_type= '<InstanceType>' # Replace with your instance type. Supported instances types: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.g4dn.xlarge, ml.g4dn.2xlarge

estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="smart-descriptions-marketplace",
    role=role,
    instance_count=1,
    instance_type=instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
)
# Run the training job.
estimator.fit({"train": training_data})

## 4: Deploy model and verify results

Now you can deploy the model for performing real-time inference.

In [16]:
model_name = "smart-descriptions"

content_type = "application/json"

real_time_inference_instance_type = "<InstanceType>" # Replace with your instance type. Supported instances types: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge

batch_transform_inference_instance_type = "<InstanceType>" # Replace with your instance type. Supported instances types: ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge

### A. Deploy trained model

In [17]:
predictor = estimator.deploy(
    1, real_time_inference_instance_type, serializer=sage.serializers.JSONSerializer()
)

..........
-----------!

Once endpoint is created, you can perform real-time inference.

### B. Create input payload

In [18]:
input_file = 'input-real-time-inference.txt'
input_data = './data/inference/input/real-time/{}'.format(input_file)

In [19]:
input_data_endpoint = []
with open(input_data) as f:
    for line in f:
        input_data_endpoint.append(json.loads(line))

In [20]:
input_data_endpoint

['<name> name = [ 1 Hotel South Beach ] ; <address> address = [ 2341 Collins Ave, Miami Beach, FL 33139, USA ] ; <feelsHotel> feelsHotel = [ luxury ] ; <hasBarOnsite> hasBarOnsite = [ yes ] ; <hasDeskInRooms> hasDeskInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasRoomsUpgraded> hasRoomsUpgraded = [ yes ] ; <hasKitchenInRoom> hasKitchenInRoom = [ yes ]',
 '<name> name = [ Ala Moana Hotel ] ; <address> address = [ 410 Atkinson Drive, Honolulu, HI 96814, USA ] ; <feelsHotel> feelsHotel = [ casual ] ; <hasConventionCenter> hasConventionCenter = [ yes ] ; <hasOnsiteCafe> hasOnsiteCafe = [ yes ] ; <hasCoffeeInRooms> hasCoffeeInRooms = [ yes ] ; <hasMinifridgeInRooms> hasMinifridgeInRooms = [ yes ] ; <hasBalconyInRooms> hasBalconyInRooms = [ yes ] ; <hasMicrowaveInRooms> hasMicrowaveInRooms = [ yes ]',
 '<name> name = [ Belvedere Hotel ] ; <address> address = [ 1900 Boardwalk, Atlantic City, NJ 08401, USA ] ; <hasBarOnsite> hasBarOnsite = [ yes ] ; <hasRestaurant> ha

Parameters definition to implement a prediction:
* max_length (int): The maximum length of the sequence to be generated.
* min_length (int): The minimum length of the sequence to be generated
* length_penalty (float, optional, defaults to 1.0): Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.

In [21]:
prediction_parameters = {
		"max_length": 150,
		"min_length": 30,
		"length_penalty": 3.0
}

<Add code snippet that shows the payload contents>

### C. Perform real-time inference

In [22]:
prediction = predictor.predict({
	'inputs': input_data_endpoint,
	'parameters': prediction_parameters
})
prediction

b'[{"generated_text":"The 1 Hotel South Beach is a luxury hotel. It has a bar and a balcony. It has upgraded rooms and a kitchen."},{"generated_text":"The Ala Moana Hotel is a casual hotel that offers a conference center and a coffee maker. It has a balcony and a mini fridge."},{"generated_text":"The Belvedere Hotel is a great choice for you. It has a bar, a restaurant, a meeting room, a spa and a casino."}]'

In [23]:
output = json.loads(prediction)
output

[{'generated_text': 'The 1 Hotel South Beach is a luxury hotel. It has a bar and a balcony. It has upgraded rooms and a kitchen.'},
 {'generated_text': 'The Ala Moana Hotel is a casual hotel that offers a conference center and a coffee maker. It has a balcony and a mini fridge.'},
 {'generated_text': 'The Belvedere Hotel is a great choice for you. It has a bar, a restaurant, a meeting room, a spa and a casino.'}]

In [24]:
with open('./data/inference/output/real-time/output.txt', 'w') as outfile:
    for entry in output:
        json.dump(entry, outfile)
        outfile.write('\n')

### F. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [25]:
predictor.delete_endpoint(delete_endpoint_config=True)

## 6. Perform Batch inference

In this section, you will perform batch inference using multiple input payloads together.

In [27]:
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)

In [28]:
# upload the batch-transform job input files to S3
transform_input_folder = "data/inference/input/batch" 
transform_input = sagemaker_session.upload_data(transform_input_folder, key_prefix=model_name + '/batch/input')
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-1-544022947556/smart-descriptions/batch/input


In [None]:
# Run the batch-transform job
transformer = estimator.transformer(1, batch_transform_inference_instance_type, strategy='SingleRecord', 
                                    output_path= 's3://{}/{}/batch/output/'.format(bucket.name, model_name),
                                   assemble_with='Line')
transformer.transform(transform_input, content_type=content_type, split_type='Line')
transformer.wait()

In [None]:
# output is available on following path
transformer.output_path

In [31]:
# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
obj = bucket.Object('{}/batch/output/input-batch-job.txt.out'.format(model_name))
key = obj.key
body = obj.get()['Body'].read()

In [32]:
body

b'[{"generated_text":"The 1 Hotel South Beach is a luxury hotel. It has a bar and a balcony. It has upgraded rooms and a kitchen."}]\n[{"generated_text":"The Ala Moana Hotel is a casual hotel that offers a conference center and a coffee maker. It has a balcony and a mini fridge."}]\n[{"generated_text":"Belvedere Hote is a great choice for you. It has a bar, a restaurant, a meeting room, a spa and a casino."}]\n'

In [33]:
body = body.decode("utf-8")
body = body.replace(']\n[', ',')
body = body.replace(']\n', ']')
body

'[{"generated_text":"The 1 Hotel South Beach is a luxury hotel. It has a bar and a balcony. It has upgraded rooms and a kitchen."},{"generated_text":"The Ala Moana Hotel is a casual hotel that offers a conference center and a coffee maker. It has a balcony and a mini fridge."},{"generated_text":"Belvedere Hote is a great choice for you. It has a bar, a restaurant, a meeting room, a spa and a casino."}]'

In [34]:
output = json.loads(body)
output

[{'generated_text': 'The 1 Hotel South Beach is a luxury hotel. It has a bar and a balcony. It has upgraded rooms and a kitchen.'},
 {'generated_text': 'The Ala Moana Hotel is a casual hotel that offers a conference center and a coffee maker. It has a balcony and a mini fridge.'},
 {'generated_text': 'Belvedere Hote is a great choice for you. It has a bar, a restaurant, a meeting room, a spa and a casino.'}]

In [35]:
with open('./data/inference/output/batch/output.txt', 'w') as outfile:
    for entry in output:
        json.dump(entry, outfile)
        outfile.write('\n')

## 7. Clean-up

### A. Delete the model

In [37]:
transformer.delete_model()