# Random predictions to SageMaker endpoint

**Note:** Please set kernel to `Python 3 (Data Science)`

---

🔥 Before running this notebook, ensure that notebook 03 ran successfully and the updated files were committed to CodeCommit. Also, ensure the SageMaker Pipeline execution is successful and the staging endpoint shows `InService`. You can check the endpoint status in the SageMaker project page, under the `Endpoints` tab.

Note: The pipeline takes ~12 minutes while deploying the SageMaker Endpoint takes ~5 minutes 🔥

In the screenshot below, the project name is `BankDM` and the endpoint is `BankDM-staging`.

![endpoint](img/check-endpoint1.png)

---

### MLOps Workflow

To recap, the following diagram shows the MLOps workflow that ran earlier. The workflow starts from the left side when you commit your code. As of now, the SageMaker Endpoint (Staging) has been deployed but the Production endpoint will not be deployed until you manually go to CodePipeline to approve the deployment.

![pipeline](img/pipeline1.png)

The diagrams below describe the workflow in more detail:

![pipeline](img/pipeline2.png)

![pipeline](img/pipeline3.png)

### High-level architecture diagram
The diagram below shows the final architecture diagram.

![diagram](img/diagram2.png)

---

## Introduction

Now that the model is trained and deployed, the next step is to use the model to do predictions. You will first connect to RedShift and retrieve the data. For simplicity, this demo uses data that is already stored in RedShift and the same data was used to do training/testing. In real world, the data used for prediction should not be seen before by the model. Before doing predictions, you need to do preprocessing of the data to ensure it is of the same format as the training data. 

---

### Variables
Variable name for secret in Secret Manager and SageMaker endpoint. RedShift, Athena and Glue information are stored in the secret.

In [1]:
secret_name='bankdm_redshift_login' 

# The endpoint name below is the default based on the SageMaker project name
# endpoint name is case sensitive
endpoint_name = 'BankDM-staging'

### Install and import libraries

In [2]:
!pip install -q SQLAlchemy==1.3.13
!pip install psycopg2-binary pyathena
!pip install -U pip
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from pyathena import connect
from botocore.exceptions import ClientError
import numpy as np
import pandas as pd
import time
import json
import boto3
import sagemaker
from sagemaker.serializers import CSVSerializer
import random
import math

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


### Create client session


In [3]:
# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

s3 = boto3.client('s3')
redshift = boto3.client('redshift')
secretsmanager = boto3.client('secretsmanager')

session = boto3.session.Session()
region = session.region_name

### Get credentials & connection information from Secret Manager

In [4]:
try:
    get_secret_value_response = secretsmanager.get_secret_value(
            SecretId=secret_name
        )
    secret_arn=get_secret_value_response['ARN']

except ClientError as e:
    print("Error retrieving secret. Error: " + e.response['Error']['Message'])
    
else:
    # Depending on whether the secret is a string or binary, one of these fields will be populated.
    if 'SecretString' in get_secret_value_response:
        secret = get_secret_value_response['SecretString']
    else:
        secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            
secret_json = json.loads(secret)
master_user_name = secret_json['username']
master_user_pw = secret_json['password']
redshift_port = secret_json['port']
redshift_cluster_identifier = secret_json['dbClusterIdentifier']
redshift_endpoint_address = secret_json['host']

database_name_redshift = secret_json['database_name_redshift']
database_name_glue = secret_json['database_name_glue']

schema_redshift = secret_json['schema_redshift']
schema_athena = secret_json['schema_athena']

table_name_glue = secret_json['table_name_glue']
table_name_redshift = secret_json['table_name_redshift']

# print(master_user_name)

## RedShift

### Connect to RedShift

In [5]:
response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
iam_role = response['Clusters'][0]['IamRoles'][0]['IamRoleArn']

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name_redshift))
session = sessionmaker()
session.configure(bind=engine)


### Get data from RedShift and shuffle it
As mentioned earlier, for simplicity, the data used for prediction is the same as the one used for training/testing purpose.

In [6]:
statement = f"select * FROM {schema_redshift}.{table_name_redshift} order by random()"

# print(statement)
data = pd.read_sql_query(statement, engine)
data.head()

Unnamed: 0,age,job,marital,education,defaulted,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,29,services,single,high.school,no,yes,no,cellular,may,thu,...,2,999,1,failure,-2.0,93.0,-46.0,1.0,,no
1,30,technician,single,professional.course,no,no,no,cellular,aug,tue,...,3,999,0,nonexistent,1.0,93.0,-36.0,5.0,,no
2,33,technician,single,university.degree,unknown,unknown,unknown,cellular,aug,wed,...,1,999,0,nonexistent,1.0,93.0,-36.0,5.0,,no
3,36,admin.,married,university.degree,no,yes,yes,telephone,jun,fri,...,2,999,0,nonexistent,1.0,94.0,-42.0,5.0,,no
4,30,blue-collar,married,basic.6y,no,yes,no,telephone,may,mon,...,3,999,0,nonexistent,1.0,94.0,-36.0,5.0,,no


### Pre-processing of data

In [7]:
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators

model_data = model_data.drop(['duration', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m'], axis=1)
model_data.columns

# combine the y_no and y_yes into one column - 0 is no while 1 is yes.
df = pd.concat([model_data['y_yes'], model_data.drop(['y_no', 'y_yes'], axis=1)], axis=1)
df.columns

Index(['y_yes', 'age', 'campaign', 'pdays', 'previous', 'no_previous_contact',
       'not_working', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
       'job_services', 'job_student', 'job_technician', 'job_unemployed',
       'job_unknown', 'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'defaulted_no', 'defaulted_unknown',
       'defaulted_yes', 'housing_no', 'housing_unknown', 'housing_yes',
       'loan_no', 'loan_unknown', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul',
       'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct',
       'month_sep', 'day_of_week_fri', 'd

In [8]:
# Extract out the target column to another dataframe and drop the column in the existing dataframe.
df_y_yes = df['y_yes']
df = df.drop(['y_yes'], axis=1)

# The y_yes is dropped from the dataframe
df.columns

Index(['age', 'campaign', 'pdays', 'previous', 'no_previous_contact',
       'not_working', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
       'job_services', 'job_student', 'job_technician', 'job_unemployed',
       'job_unknown', 'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'defaulted_no', 'defaulted_unknown',
       'defaulted_yes', 'housing_no', 'housing_unknown', 'housing_yes',
       'loan_no', 'loan_unknown', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul',
       'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct',
       'month_sep', 'day_of_week_fri', 'day_of_wee

In [9]:
# Check the original dataframe
df.head()

Unnamed: 0,age,campaign,pdays,previous,no_previous_contact,not_working,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,29,2,999,1,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,30,3,999,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,33,1,999,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
3,36,2,999,0,1,0,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
4,30,3,999,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0


In [10]:
# Check the values of the dataframe. Later on, one row will be randomly selected for prediction.
df.values

array([[ 29,   2, 999, ...,   1,   0,   0],
       [ 30,   3, 999, ...,   0,   1,   0],
       [ 33,   1, 999, ...,   0,   1,   0],
       ...,
       [ 52,   2, 999, ...,   0,   1,   0],
       [ 38,   1, 999, ...,   0,   1,   0],
       [ 32,   1, 999, ...,   0,   1,   0]])

In [11]:
# Check that the target value dataframe is in the correct format - 0 is false while 1 is true
df_y_yes.head()

0    0
1    0
2    0
3    0
4    0
Name: y_yes, dtype: uint8

In [12]:
# Get a prediction
predictor = sagemaker.predictor.Predictor(endpoint_name=endpoint_name)
predictor.serializer = CSVSerializer()

In [13]:
# Randomly select a user and predict the result
i = random.randint(1, len(df)) 
print('user id chosen: ' + str(i))
pred = predictor.predict(df.values[i])
pred

user id chosen: 10714


b'0.04751679301261902'

In [14]:
# Check the actual result
df_y_yes[i]

0

In [15]:
# Compare if the pred and actual result are similar with a tolerance of 0.1
math.isclose(float(pred), df_y_yes[i], abs_tol=0.1)

True

---

## Next steps

Now that prediction using a SageMaker endpoint works, the next step is to use RedShift ML to do prediction as well.