# Predict breast cancer type using SageMaker linear-learner

Contents
- [0. Setup](#setup)
- [1. Introduction](#introduction)
- [2. Load and explore data](#load)
- [3. Create logistic regression model using SageMaker linear-learner algorithm](#model)
- [4. Deploy the model in the AWS Cloud](#deployment)
- [5. Score the model](#score)

**Note:** This notebook works correctly with kernel `Python 3.5+`.

<a id="setup"></a>
## 0. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a SageMaker Service, setup steps are described here: https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html
- Install reqiured python packages from PyPi repository

### Package installation

In [None]:
!pip install boto3 | tail -n 1
!pip install sagemaker | tail -n 1
!pip install pandas | tail -n 1
!pip install scikit-learn | tail -n 1

<a id="introduction"></a>
## 1. Introduction

This notebook defines, trains, and deploys the model that predicts cancer type.

<a id="load"></a>
## 2. Load and explore data

In this section you will load data into a pandas DataFrame and perform a basic exploration. Next you will upload training data to the Amazon S3 Object Storage.

### 2.1 Load data from webpage

In [None]:
import pandas as pd
from sklearn.utils import shuffle

In [None]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)

data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"] 

### 2.2 Explore data

In [None]:
print("Sample records:")
display(data.head())

print("Features columns summary:")
display(data.iloc[:, 2:].describe())

print("Label column summary:")
display(data.diagnosis.value_counts())

### 2.3 Store training data in S3 Object Storage

You will use SageMaker linear-learner built-in algorithm. This algorithm expects the first column to be the label when training data is in `text/csv` format.

Moreover, the label column has to be numeric.

#### Save prepared data to local filesystem

In [None]:
data_shuffled = shuffle(data)
data_shuffled.replace({'diagnosis': {'M': 1, 'B': 0}}, inplace = True)
display(data_shuffled.head())

In [None]:
train_data_filename = 'breast_cancer.csv'
data_shuffled.iloc[:, 1:].to_csv(path_or_buf = train_data_filename, index = False, header = False)

**Note:** Header row has to be omitted.

#### Upload data to S3 Object Storage

In [None]:
import time
import json
import boto3

In [None]:
aws_credentials = {'access_key': '****', 
                   'secret_key': '****', 
                   'region_name': '****'} #i.e. us-east-2

**Note:** You have to provide credentials from your Amazon account.

In [None]:
session = boto3.Session(
    aws_access_key_id = aws_credentials['access_key'],
    aws_secret_access_key = aws_credentials['secret_key'],
    region_name = aws_credentials['region_name']
)
s3 = session.resource('s3')

#### 2.4 [Create an S3 bucket](https://s3.console.aws.amazon.com/s3) and use the name in the cell below for `bucket_name`.



In [None]:
bucket_name = 'XXXXXXXXXXXXXX'
train_data_filename = 'breast_cancer.csv'
train_data_path = 'breast-cancer/train'
output_data_path = 's3://{}/breast-cancer/output'.format(bucket_name)
time_suffix = time.strftime("%Y-%m-%d-%H-%M", time.gmtime())

**Note:** You have to replace `bucket_name` with name of bucket in your S3 Object Storage. 

You can run following code `[bkt.name for bkt in s3.buckets.all()]` to list all your buckets.

In [None]:
s3_bucket = s3.Bucket(bucket_name)
s3_bucket.upload_file(Filename = train_data_filename, Key = '{}/{}'.format(train_data_path, train_data_filename))

Let's check if your data has been uploaded successfully.

In [None]:
for s3_obj in s3_bucket.objects.all():
    if (s3_obj.bucket_name == bucket_name) and (train_data_path in s3_obj.key):
        train_data_uri = 's3://{}/{}'.format(s3_obj.bucket_name, s3_obj.key)
        print(train_data_uri)

<a id="model"></a>
## 3. Create logistic regression model using SageMaker linear-learner algorithm

In this section you will learn how to:

- [3.1 Setup training parameters](#prep)
- [3.2 Start training job](#train)

<a id="prep"></a>
### 3.1 Setup training parameters

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

sm_client = session.client('sagemaker')

In [None]:
training_image = get_image_uri(session.region_name, 'linear-learner')

iam_client = session.client('iam')
[role_arn, *_] = [role['Arn'] for role in iam_client.list_roles()['Roles'] if 'AmazonSageMaker-ExecutionRole' in role['RoleName']]

linear_job_name = 'Breast-cancer-linear-learner-' + time_suffix

In [None]:
linear_training_params = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "HyperParameters": {
        "feature_dim": "30",
        "mini_batch_size": "100",
        "predictor_type": "binary_classifier",
        "epochs": "10",
        "num_models": "32",
        "loss": "auto"
    },
    "InputDataConfig": [{
        "ChannelName": "train",
        "ContentType": "text/csv", 
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": train_data_uri,
                "S3DataDistributionType": "ShardedByS3Key"
            }
        }
    }],
    "OutputDataConfig": {"S3OutputPath": output_data_path},
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 2
    },
    "RoleArn": role_arn,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 6 * 60
    },
    "TrainingJobName": linear_job_name

}

<a id="train"></a>
### 3.2 Start training job

In [None]:
sm_client.create_training_job(**linear_training_params)

In [None]:
try:
    sm_client.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName = linear_job_name)
except Exception:
    print('Traing job error.')

In [None]:
train_job_details = sm_client.describe_training_job(TrainingJobName = linear_job_name)
train_job_status = train_job_details['TrainingJobStatus']

if train_job_status == 'Failed':
    print(train_job_details['FailureReason'])
else:
    train_job_arn = train_job_details['TrainingJobArn']
    print(train_job_arn)
    trained_model_uri = train_job_details['ModelArtifacts']['S3ModelArtifacts']
    print(trained_model_uri)

<a id="deployment"></a>
## 4. Deploy model in the AWS Cloud

### 4.1 Setup deployment parameters

In [None]:
linear_hosting_container = {'Image': training_image, 'ModelDataUrl': trained_model_uri}

create_model_details = sm_client.create_model(
    ModelName = linear_job_name,
    ExecutionRoleArn = role_arn,
    PrimaryContainer = linear_hosting_container)

print(create_model_details['ModelArn'])

### 4.2 Create deployment configuration endpoint

In [None]:
endpoint_config = 'Breast-cancer-linear-endpoint-config-' + time_suffix
print(endpoint_config)

create_endpoint_config_details = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config,
    ProductionVariants = [{
        'InstanceType': 'ml.m4.xlarge',
        'InitialInstanceCount': 1,
        'ModelName': linear_job_name,
        'VariantName': 'AllTraffic'}])

In [None]:
endpoint_config_details = sm_client.describe_endpoint_config(EndpointConfigName = endpoint_config)
print(endpoint_config_details)

### 4.3 Create scoring endpoint

In [None]:
scoring_endpoint = 'Breast-cancer-endpoint-scoring-' + time_suffix

create_endpoint_details = sm_client.create_endpoint(
    EndpointName = scoring_endpoint,
    EndpointConfigName = endpoint_config)

In [None]:
try:
    sm_client.get_waiter('endpoint_in_service').wait(EndpointName = scoring_endpoint)
except Exception:
    print('Create scoring endpoint error')

In [None]:
scoring_endpoint_details = sm_client.describe_endpoint(EndpointName = scoring_endpoint)
scoring_enpoint_config_status = scoring_endpoint_details['EndpointStatus']
if scoring_enpoint_config_status != 'InService':
    print(scoring_endpoint_details['FailureReason'])
else:
    print(scoring_endpoint_details['EndpointArn'])

<a id="score"></a>
## 5. Score the model

### 5.1 Prepare sample data for scoring

You will use data in `csv` format as the scoring payload. The first column (label) is removed from the data and the last 20 training records are selected.

In [None]:
scoring_data_filename = 'scoring_breast_cancer.csv'

In [None]:
with open(train_data_filename) as f_train:
    with open(scoring_data_filename, 'w') as f_score:
        f_score.writelines([','.join(line.split(',')[1:]) for line in f_train.readlines()[-20:]])

### 5.2 Send data for scoring

In [None]:
sm_runtime = session.client('runtime.sagemaker')

with open(scoring_data_filename) as f_payload:
    scoring_response = sm_runtime.invoke_endpoint(EndpointName = scoring_endpoint,
                                                  ContentType = 'text/csv',
                                                  Body = f_payload.read().encode())
    
    scored_records = scoring_response['Body'].read().decode()
    print(json.loads(scored_records))

### Authors

Wojciech Sobala, Data Scientist at IBM