# Integrating ThirdAI with Sagemaker

In this demo, we will see how we can integrate ThirdAI with Amazon Sagemaker and make use of the estimation and deployment pipeline of Sagemaker for end-to-end model training and deployment. 

We will train a ThirdAI Universal Deep Transformer model on amazon polarity dataset. The training and deployment is a 3 step process:

1. Initializing an Estimator for the ThirdAI UDT container
2. Training the model on own dataset.
3. Deploying the model

In [1]:
import sagemaker
import boto3

from sagemaker.estimator import Estimator
from sagemaker.predictor import Predictor

Permissions Required: 
Sagemaker notebook instance must have access to ECR and S3 bucket to load and store data. To change the permissions attached to the sagemaker role, follow the commands: 

1. Select your notebook from the **Notebook Instances** in Sagemaker menu 
2. Go to **Permissions and Encryption** and click on the **IAM role ARN** ( a new tab should pop-up )
3. In the new tab, go to **Permissions** and Click on **Add Permissions**
4. Search **Registry** in the **Menu**
5. Add the policies 
    * AmazonElasticContainerRegistryPublicFullAccess 
    * AmazonEC2ContainerRegistryFullAccess
    * AmazonS3FullAccess
    * AmazonSageMakerFullAccess 
 
 
Estimators use a private ECR repository for serving your model. Hence, we will first pull the image from the ThirdAI public repository and then push the image to your private ECR registry. 

Run the following bash commands for setting up your private registry


In [None]:
%%bash
# pulling the public image to your device
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/s5n2a0h0
docker pull public.ecr.aws/s5n2a0h0/thirdai_udt:latest

# getting your account name and region for pushing the docker image
image="thirdai_udt_demo"
region=$(aws configure get region)
account=$(aws sts get-caller-identity --query Account --output text)
imagename="${account}.dkr.ecr.${region}.amazonaws.com/${image}:latest"

# if the repository doesn't exist, we will create it
aws ecr describe-repositories --repository-names "${image}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    echo "Doesn't exist"
    aws ecr create-repository --repository-name "${image}" > /dev/null
fi

# tag the public image with your private image name
docker tag public.ecr.aws/s5n2a0h0/thirdai_udt:latest ${imagename}
echo $imagename

# login to your private repository and push the docker container
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com
docker push ${imagename}

# imagename is your ECR ARN (resource number)

In [3]:
default_bucket = "mlflow-trial-delete-later" #put your bucket name here
role=sagemaker.get_execution_role()
boto_session = boto3.Session()

sagemaker_session = sagemaker.Session(
        boto_session = boto_session,
        default_bucket = default_bucket
)

#put the imagename from the above bash script here.
ecr_image = "your_imagename"  

job_name="amazon-polarity-demo"
output_path=f"sagemaker/thirdai/demo/{job_name}"

dataset_folder="datasets/amazon_polarity"
license_folder="license_file"

Store the dataset in your s3 bucket and set the train_file_path and test_file_path. You also need to provide a license_file_path. Reach out to us at contact@thirdai.com for getting a license file.

Note that the name of the license file should be license.serialized

In [4]:
train_file_path=f"s3://{default_bucket}/{dataset_folder}/train.csv"
test_file_path=f"s3://{default_bucket}/{dataset_folder}/test.csv"
license_file_path=f"s3://{default_bucket}/{license_folder}/license.serialized"

Hyperparameters:

* Required: 
    1. epochs
    2. learning-rate
    3. train-file-name


* Optional:
    1. test-file-name : If provided, the model will be evaluated on the test set
    2. udt_args : Arguments provided to UDT. data_types is automatically inferred from the training file. 
    3. model-file-name : If provided, udt_args are ignored.

Note: Atleast one of udt_args or model-file-name have to be provided. 

In [5]:
hyperparameters={
    "train-file-name":"train.csv", 
    "epochs":1,
    "learning-rate":0.01,
    "udt_args":"{'target': 'label', 'n_target_classes': 2, 'delimiter': ','}"
}
data_channels={
    "train":train_file_path,
    "test": test_file_path,
    "license": license_file_path,
}

In [6]:
estimator=Estimator(
    role=role,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type="ml.m4.2xlarge",
    image_uri=ecr_image,
    base_job_name=job_name,
    hyperparameters=hyperparameters,
    output_path=f"s3://{default_bucket}/{output_path}/model"
)

The model and the related artifacts are stored in your S3 bucket at the location **output_path**. Sagemaker stores teh model artifacts as **model.tar.gz**. To train a new estimator from an existing estimator, specify **model_uri** in the Estimator function as


<code>   estimator = Estimator(
        model_uri="path_to_model_artifacts"
    )
</code>

In [7]:
estimator.hyperparameters()

{'train-file-name': 'train.csv',
 'epochs': 1,
 'learning-rate': 0.01,
 'udt_args': "{'target': 'label', 'n_target_classes': 2, 'delimiter': ','}"}

In [None]:
estimator.fit(data_channels, wait=True)

Since, the model has been trained using Sagemaker's Estimator Pipeline, it supports all the niceties of Sageamaker. To deploy the model, just type <code>estimator.deploy()</code>.

All ThirdAI models run on CPUs and have faster inference and training times. Hence, Sagemaker estimators can be deployed on **CPU** only instances. 

In [9]:
predictor = estimator.deploy(
    initial_instance_count = 1,
    instance_type = "ml.t2.medium",
)

----------!

In [10]:
from sagemaker.serializers import JSONSerializer

In [15]:
data={"title":"The product is very good"}
stdata=JSONSerializer().serialize(data)

In [16]:
predictor.predict(stdata).decode()

'1'