# Adverse event clustering pipeline

#### Note: This exercise is part of chapter 9 in the book "Applied Machine Learning for Healthcare and Lifesciences on AWS". Make sure you have completed the steps as outlined in the prerequisites section of chapter 9 to successfully complete this exercise.

In this notebook we will will create a Sagemaker pipeline to preprocess the data, train a model and register the model in Sagemaker model registry. Here are the details of each step in the pipeline:

1. The preprocessing step is carried out in a custom container. During preprocessing, we download raw data, sample 100 rows from it, extract top 5 medical contions from them and vectorize those conditions. 
2. In the training step, we use a Sagemaker scikitlearn container to train a clustering model in Sagemaker
3. In the final step, we register the model in Sagemaker model registry.


Lets start by making sure we have the correct version of Sagemaker installed. 


In [20]:
import sys

!{sys.executable} -m pip install "sagemaker>=2.99.0"



Next, we import the required libraries.

In [1]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
import boto3
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.workflow.pipeline_context import PipelineSession

We also set some variables we will use in this notebook later.

In [6]:
sagemaker_session=sagemaker.Session()
pipeline_session = PipelineSession()
bucket = sagemaker_session.default_bucket()

role = get_execution_role()
prefix = 'chapter9/data'

print('Training input/output will be stored in {}/{}'.format(bucket, prefix))
print('\nIAM Role: {}'.format(role))

Training input/output will be stored in sagemaker-us-east-1-485822383573/chapter9/data

IAM Role: arn:aws:iam::485822383573:role/service-role/AmazonSageMaker-ExecutionRole-20220426T122295


Let us start by examining our preprocessing script.

In [21]:
!pygmentize scripts/preprocessing.py

[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mwget[39;49;00m
[34mimport[39;49;00m [04m[36mzipfile[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m Normalizer
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mfeature_extraction[39;49;00m[04m[36m.[39;49;00m[04m[36mtext[39;49;00m [34mimport[39;49;00m TfidfVectorizer
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m

parser = argparse.ArgumentParser()
parser.add_argument([33m'[3

As you can see from the script, we upload 3 files to S3 at the end of preprocessing. Examine the preprocessing code to make sure you understand how we are processing the raw data.

We create a docker container to run our preprocessing step. Let us look at the details of the container by examining the Dockerfile

In [23]:
!pygmentize scripts/Dockerfile

[34mFROM[39;49;00m [33mpython:3.7-slim-buster[39;49;00m

[34mRUN[39;49;00m pip install pandas
[34mRUN[39;49;00m pip install wget
[34mRUN[39;49;00m pip install boto3
[34mRUN[39;49;00m pip install sagemaker
[34mRUN[39;49;00m pip install scikit-learn

[37m# Make sure python doesn't buffer stdout so we get logs ASAP.[39;49;00m
[34mENV[39;49;00m [31mPYTHONUNBUFFERED[39;49;00m=TRUE
[34mENTRYPOINT[39;49;00m [[33m"python3"[39;49;00m]


Next, We build a docker container in using the shell script below

In [9]:
%%sh

docker_name=sagemaker-preprocessing
account=$(aws sts get-caller-identity --query Account --output text)
echo $account
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${docker_name}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${docker_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${docker_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)
docker build -t $docker_name -f scripts/Dockerfile .
docker tag ${docker_name} ${fullname}
docker push ${fullname}

485822383573
Login Succeeded

Step 1/8 : FROM python:3.7-slim-buster
 ---> 8fe6e55c0412
Step 2/8 : RUN pip install pandas
 ---> Using cache
 ---> ed3c2aadaa6e
Step 3/8 : RUN pip install wget
 ---> Using cache
 ---> 93dc76d1c100
Step 4/8 : RUN pip install boto3
 ---> Using cache
 ---> 43acfff1ec93
Step 5/8 : RUN pip install sagemaker
 ---> Using cache
 ---> 0a0768240618
Step 6/8 : RUN pip install scikit-learn
 ---> Using cache
 ---> 2ce6c8fc1e49
Step 7/8 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> f68470613295
Step 8/8 : ENTRYPOINT ["python3"]
 ---> Using cache
 ---> 41fbb5b0e27c
Successfully built 41fbb5b0e27c
Successfully tagged sagemaker-preprocessing:latest
The push refers to repository [485822383573.dkr.ecr.us-east-1.amazonaws.com/sagemaker-preprocessing]
20e4ef78b000: Preparing
1cd08d11abf2: Preparing
073fe9ab5fca: Preparing
873a3963f49c: Preparing
32682a294d34: Preparing
cd77cebc5d3e: Preparing
c899963fae46: Preparing
353cc9dc1c96: Preparing
c89d0deb3e29: Preparing
73595

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



We are now ready to define our steps in the pipeline. Lets begin by defining the preprocessing step.

In [10]:
from sagemaker.processing import ScriptProcessor
from sagemaker.workflow.steps import ProcessingStep

docker_name = "sagemaker-preprocessing"
account = sagemaker_session.boto_session.client("sts").get_caller_identity()["Account"]
region = sagemaker_session.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, docker_name)
print(image)
script_processor = ScriptProcessor(image_uri=image,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge',
                command=['python3'],
                sagemaker_session=pipeline_session)


processor_args=script_processor.run(code='scripts/preprocessing.py',
                    arguments = ["--bucket",bucket,'--region',region])



step_process = ProcessingStep(
    name="PreprocessData",
    step_args=processor_args,
)


485822383573.dkr.ecr.us-east-1.amazonaws.com/sagemaker-preprocessing:latest





Job Name:  sagemaker-preprocessing-2022-08-16-01-24-34-435
Inputs:  [{'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-485822383573/sagemaker-preprocessing-2022-08-16-01-24-34-435/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  []


Just like the preprocessing script, we have a training script that we will use to train our model. Lets look at that in more detail. 

In [25]:
!pygmentize scripts/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mcluster[39;49;00m [34mimport[39;49;00m KMeans
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m joblib
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m Normalizer
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mfeature_extraction[39;49;00m[04m[36m.[39;49;00m[04m[36mtext[39;49;00m [34mimport[39;49;00m TfidfVectorizer
[34mfrom[39;49;00m [04m[36mio[39;49;00m [34mimport[39;49;00m StringIO

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    
    kmeans = joblib.load(os.path.join(model_dir, [33m"

As you can see, the script trains a kmeans clustering model with 2 clusters. We will now add a training step to our pipeline. 

In [12]:
from sagemaker.workflow.steps import TrainingStep

model_path= f"s3://{bucket}/{prefix}/model/"


sklearn = SKLearn(
    source_dir='scripts',
    entry_point='train.py',
    instance_type="ml.m4.xlarge",
    role = role,
    sagemaker_session=pipeline_session,
    framework_version='0.20.0',
    output_path=model_path,
    hyperparameters={'n_clusters': 2, 'random_state':0})

train_args=sklearn.fit({'training': 's3://{}/{}/prediction_data.csv'.format(bucket,prefix)})
step_train_model = TrainingStep(name="TrainModel", step_args=train_args)
step_train_model.add_depends_on([step_process])

We are now ready to register our model to Sagemaker model registry. This is done in the code block below. 

In [13]:
from sagemaker.model import Model
from sagemaker.sklearn.model import SKLearnModel
from sagemaker.workflow.model_step import ModelStep



clustering_model = SKLearnModel(
    model_data=step_train_model.properties.ModelArtifacts.S3ModelArtifacts,
    role=role,
    sagemaker_session=pipeline_session,
    entry_point="scripts/train.py",
    framework_version='0.20.0',
    
)



register_model_step_args = clustering_model.register(
    content_types=["text/csv"],
   response_types=["text/csv"],
   inference_instances=["ml.t2.medium"],
   model_package_group_name='adverse-event-clustering'
)

step_register=ModelStep(name='adverse-event-clustering-model', step_args=register_model_step_args)

Next, we add the three sequential steps into a pipeline and look at the pipeline definition.  

In [14]:
from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
    name="adverse-drug-reaction",
    steps=[step_process, step_train_model, step_register]
)

In [15]:
import json
definition = json.loads(pipeline.definition())
definition

{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'PreprocessData',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.m5.xlarge',
      'InstanceCount': 1,
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '485822383573.dkr.ecr.us-east-1.amazonaws.com/sagemaker-preprocessing:latest',
     'ContainerArguments': ['--bucket',
      'sagemaker-us-east-1-485822383573',
      '--region',
      'us-east-1'],
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/preprocessing.py']},
    'RoleArn': 'arn:aws:iam::485822383573:role/service-role/AmazonSageMaker-ExecutionRole-20220426T122295',
    'ProcessingInputs': [{'InputName': 'code',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-485822383573/sagem

We are now ready to start our pipeline execution. The next few lines of code begins the pipeline execution and looks at its status.

In [16]:
pipeline.upsert(role_arn=role)

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:485822383573:pipeline/adverse-drug-reaction',
 'ResponseMetadata': {'RequestId': '21caa611-b96e-4e37-af34-b6c47e70e945',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '21caa611-b96e-4e37-af34-b6c47e70e945',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '89',
   'date': 'Tue, 16 Aug 2022 01:25:25 GMT'},
  'RetryAttempts': 0}}

In [17]:
execution = pipeline.start()

In [18]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:485822383573:pipeline/adverse-drug-reaction',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:485822383573:pipeline/adverse-drug-reaction/execution/7hzfj8g9p4o7',
 'PipelineExecutionDisplayName': 'execution-1660613127172',
 'PipelineExecutionStatus': 'Executing',
 'PipelineExperimentConfig': {'ExperimentName': 'adverse-drug-reaction',
  'TrialName': '7hzfj8g9p4o7'},
 'CreationTime': datetime.datetime(2022, 8, 16, 1, 25, 27, 94000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2022, 8, 16, 1, 25, 27, 94000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedBy': {},
 'ResponseMetadata': {'RequestId': '4451c759-bd77-48fc-be65-cf2a383ec9d7',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4451c759-bd77-48fc-be65-cf2a383ec9d7',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '504',
   'date': 'Tue, 16 Aug 2022 01:25:28 GMT'},
  'RetryAttempts': 0}}

In [19]:
execution.wait()

The pipeline is now running. At the end of this run, you will have a model registered in Sagemaker model registry. Leave the notebook running. Return back to chapter 9 in the book to see instructions to complete the remaining steps.