## Sagemaker Tutorial Series

### Tutorial - 1 Mobile Price Classification using SKLearn Custom Script in Sagemaker

Data Source - https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification?resource=download

### Let's divide the workload
1. Initialize Boto3 SDK and create S3 bucket. 
2. Upload data in Sagemaker Local Storage. 
3. Data Exploration and Understanding.
4. Split the data into Train/Test CSV File. 
5. Upload data into the S3 Bucket.
6. Create Training Script
7. Train script in-side Sagemaker container. 
8. Store Model Artifacts(model.tar.gz) into the S3 Bucket. 
9. Deploy Sagemaker Endpoint(API) for trained model, and test it. 

In [2]:
import sklearn # Check Sklearn version
sklearn.__version__

'1.3.2'

## 1. Initialize Boto3 SDK and create S3 bucket. 

In [3]:
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
import datetime
import time
import tarfile
import boto3
import pandas as pd

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = 'sagemaker-tutorials-mlhub-039229394722' # Mention the created S3 bucket name here
print("Using bucket " + bucket)

  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Using bucket sagemaker-tutorials-mlhub-039229394722


In [4]:
get_execution_role()

'arn:aws:iam::039229394722:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole'

## 3. Data Exploration and Understanding.

In [5]:
df = pd.read_csv("s3://sagemaker-tutorials-mlhub-039229394722/mob_price_classification_train.csv")

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



In [6]:
df.shape

(2000, 21)

In [9]:
# ['Low_Risk','High_Risk'],[0,1]
df['price_range'].value_counts(normalize=True)

price_range
1    0.25
2    0.25
3    0.25
0    0.25
Name: proportion, dtype: float64

In [10]:
df.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [12]:
# Find the Percentage of Values are missing
df.isnull().mean() * 100

battery_power    0.0
blue             0.0
clock_speed      0.0
dual_sim         0.0
fc               0.0
four_g           0.0
int_memory       0.0
m_dep            0.0
mobile_wt        0.0
n_cores          0.0
pc               0.0
px_height        0.0
px_width         0.0
ram              0.0
sc_h             0.0
sc_w             0.0
talk_time        0.0
three_g          0.0
touch_screen     0.0
wifi             0.0
price_range      0.0
dtype: float64

In [13]:
features = list(df.columns)
features

['battery_power',
 'blue',
 'clock_speed',
 'dual_sim',
 'fc',
 'four_g',
 'int_memory',
 'm_dep',
 'mobile_wt',
 'n_cores',
 'pc',
 'px_height',
 'px_width',
 'ram',
 'sc_h',
 'sc_w',
 'talk_time',
 'three_g',
 'touch_screen',
 'wifi',
 'price_range']

In [14]:
label = features.pop(-1)
label

'price_range'

In [15]:
x = df[features]
y = df[label]

In [16]:
x.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0


In [16]:
# {0: 'Low_Risk',1: 'High_Risk'}
y.head()

0    1
1    2
2    2
3    2
4    1
Name: price_range, dtype: int64

In [17]:
x.shape

(2000, 20)

In [18]:
y.value_counts()

price_range
1    500
2    500
3    500
0    500
Name: count, dtype: int64

In [19]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.15, random_state=0)

In [20]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1700, 20)
(300, 20)
(1700,)
(300,)


## 4. Split the data into Train/Test CSV File. 

In [21]:
trainX = pd.DataFrame(X_train)
trainX[label] = y_train

testX = pd.DataFrame(X_test)
testX[label] = y_test

In [22]:
print(trainX.shape)
print(testX.shape)

(1700, 21)
(300, 21)


## 5. Upload data into the S3 Bucket.

In [23]:
trainX.to_csv("train-V-1.csv",index = False)
testX.to_csv("test-V-1.csv", index = False)

In [24]:
#help(sess.upload_data)

In [25]:
# send data to S3. SageMaker will take training data from s3
sk_prefix = "sagemaker/mobile_price_classification/sklearncontainer"
trainpath = sess.upload_data(
    path="train-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)

testpath = sess.upload_data(
    path="test-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)

## 6. Create Training Script

In [27]:
%%writefile script.py


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_curve, auc
import sklearn
import joblib
import boto3
import pathlib
from io import StringIO 
import argparse
import joblib
import os
import numpy as np
import pandas as pd

# inference functions ---------------

# def input_fn(request_body, request_content_type):
#     print(request_body)
#     print(request_content_type)
#     if request_content_type == "text/csv":
#         request_body = request_body.strip()
#         try:
#             df = pd.read_csv(StringIO(request_body), header=None)
#             return df
        
#         except Exception as e:
#             print(e)
#     else:
#         return """Please use Content-Type = 'text/csv' and, send the request!!""" 
 
    
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

# def predict_fn(input_data, model):
#     if type(input_data) != str:
#         prediction = model.predict(input_data)
#         print(prediction)
#         return prediction
#     else:
#         return input_data
        
    
if __name__ == "__main__":

    print("[INFO] Extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--n_estimators", type=int, default=100)
    parser.add_argument("--random_state", type=int, default=0)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="train-V-1.csv")
    parser.add_argument("--test-file", type=str, default="test-V-1.csv")

    args, _ = parser.parse_known_args()
    
    print("SKLearn Version: ", sklearn.__version__)
    print("Joblib Version: ", joblib.__version__)

    print("[INFO] Reading data")
    print()
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    features = list(train_df.columns)
    label = features.pop(-1)
    
    print("Building training and testing datasets")
    print()
    X_train = train_df[features]
    X_test = test_df[features]
    y_train = train_df[label]
    y_test = test_df[label]

    print('Column order: ')
    print(features)
    print()
    
    print("Label column is: ",label)
    print()
    
    print("Data Shape: ")
    print()
    print("---- SHAPE OF TRAINING DATA (85%) ----")
    print(X_train.shape)
    print(y_train.shape)
    print()
    print("---- SHAPE OF TESTING DATA (15%) ----")
    print(X_test.shape)
    print(y_test.shape)
    print()
    
  
    print("Training RandomForest Model.....")
    print()
    model =  RandomForestClassifier(n_estimators=args.n_estimators, random_state=args.random_state, verbose = 3,n_jobs=-1)
    model.fit(X_train, y_train)
    print()
    

    model_path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model,model_path)
    print("Model persisted at " + model_path)
    print()

    
    y_pred_test = model.predict(X_test)
    test_acc = accuracy_score(y_test,y_pred_test)
    test_rep = classification_report(y_test,y_pred_test)

    print()
    print("---- METRICS RESULTS FOR TESTING DATA ----")
    print()
    print("Total Rows are: ", X_test.shape[0])
    print('[TESTING] Model Accuracy is: ', test_acc)
    print('[TESTING] Testing Report: ')
    print(test_rep)

Overwriting script.py


In [28]:
! python script.py --n_estimators 100 \
                   --random_state 0 \
                   --model-dir ./ \
                   --train ./ \
                   --test ./ \

  from pandas.core.computation.check import NUMEXPR_INSTALLED
[INFO] Extracting arguments
SKLearn Version:  1.3.2
Joblib Version:  1.3.2
[INFO] Reading data

Building training and testing datasets

Column order: 
['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g', 'touch_screen', 'wifi']

Label column is:  price_range

Data Shape: 

---- SHAPE OF TRAINING DATA (85%) ----
(1700, 20)
(1700,)

---- SHAPE OF TESTING DATA (15%) ----
(300, 20)
(300,)

Training RandomForest Model.....

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 

In [29]:
%ls

[0m[01;34mlost+found[0m/   script.py            test-V-1.csv
model.joblib  Taller ML Ops.ipynb  train-V-1.csv


## 7. Train script in-side Sagemaker container.

In [30]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.large",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="RF-custom-sklearn",
    hyperparameters={
        "n_estimators": 100,
        "random_state": 0,
    },
    use_spot_instances = True,
    max_wait = 7200,
    max_run = 3600
)

In [31]:
# launch training job, with asynchronous call
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)
# sklearn_estimator.fit({"train": datapath}, wait=True)

INFO:sagemaker:Creating training-job with name: RF-custom-sklearn-2024-02-25-13-24-21-445


2024-02-25 13:24:22 Starting - Starting the training job...
2024-02-25 13:24:36 Starting - Preparing the instances for training...
2024-02-25 13:25:16 Downloading - Downloading input data...
2024-02-25 13:25:42 Downloading - Downloading the training image...
2024-02-25 13:26:17 Training - Training image download completed. Training in progress..[34m2024-02-25 13:26:22,557 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2024-02-25 13:26:22,560 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-02-25 13:26:22,597 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2024-02-25 13:26:22,759 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-02-25 13:26:22,771 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-02-25 13:26:22,782 sagemaker-training-toolkit INFO     No GPUs det

## 8. Store Model Artifacts(model.tar.gz) into the S3 Bucket. 

In [35]:
sklearn_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]

print("Model artifact persisted at " + artifact)


2024-02-25 13:26:43 Starting - Preparing the instances for training
2024-02-25 13:26:43 Downloading - Downloading the training image
2024-02-25 13:26:43 Training - Training image download completed. Training in progress.
2024-02-25 13:26:43 Uploading - Uploading generated training model
2024-02-25 13:26:43 Completed - Training job completed
Model artifact persisted at s3://sagemaker-us-east-1-039229394722/RF-custom-sklearn-2024-02-25-13-24-21-445/output/model.tar.gz


## 9. Deploy Sagemaker Endpoint(API) for trained model, and test it. 

In [36]:
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime

model_name = "MISD-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model = SKLearnModel(
    name =  model_name,
    model_data=artifact,
    role=get_execution_role(),
    entry_point="script.py",
    framework_version=FRAMEWORK_VERSION,
)

In [37]:
#endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_name = "MISD-sklearn-model"
print("EndpointName={}".format(endpoint_name))

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
)


EndpointName=MISD-sklearn-model


INFO:sagemaker:Creating model with name: MISD-sklearn-model-2024-02-25-13-29-35
INFO:sagemaker:Creating endpoint-config with name MISD-sklearn-model
INFO:sagemaker:Creating endpoint with name MISD-sklearn-model


--------

KeyboardInterrupt: 

In [None]:
print(predictor.predict(testX[features][0:2].values.tolist()))

In [123]:
testX[features][0:2].values.tolist()

[[1454.0,
  1.0,
  0.5,
  1.0,
  1.0,
  0.0,
  34.0,
  0.7,
  83.0,
  4.0,
  3.0,
  250.0,
  1033.0,
  3419.0,
  7.0,
  5.0,
  5.0,
  1.0,
  1.0,
  0.0],
 [1092.0,
  1.0,
  0.5,
  1.0,
  10.0,
  0.0,
  11.0,
  0.5,
  167.0,
  3.0,
  14.0,
  468.0,
  571.0,
  737.0,
  14.0,
  4.0,
  11.0,
  0.0,
  1.0,
  0.0]]

In [49]:
testX[features][0:2]

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
405,1454,1,0.5,1,1,0,34,0.7,83,4,3,250,1033,3419,7,5,5,1,1,0
1190,1092,1,0.5,1,10,0,11,0.5,167,3,14,468,571,737,14,4,11,0,1,0


In [45]:
print(predictor.predict(testX[features][0:2].values.tolist()))

[3 0]


## Don't forget to delete the endpoint !

In [103]:
sm_boto3.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '4ef260d2-055a-4aa1-994d-0d05c815567b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4ef260d2-055a-4aa1-994d-0d05c815567b',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Sun, 25 Feb 2024 03:30:03 GMT'},
  'RetryAttempts': 0}}

In [55]:
help(runtime.invoke_endpoint)

Help on method invoke_endpoint in module botocore.client:

invoke_endpoint(*args, **kwargs) method of botocore.client.SageMakerRuntime instance
    After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint.
    
     
    
    For an overview of Amazon SageMaker, see `How It Works <https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works.html>`__.
    
     
    
    Amazon SageMaker strips all POST headers except those supported by the API. Amazon SageMaker might add additional headers. You should not rely on the behavior of headers outside those enumerated in the request syntax.
    
     
    
    Calls to ``InvokeEndpoint`` are authenticated by using Amazon Web Services Signature Version 4. For information, see `Authenticating Requests (Amazon Web Services Signature Version 4) <https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-reque

In [72]:
testX[features][0:1].to_dict('index')[405]

{'battery_power': 1454,
 'blue': 1,
 'clock_speed': 0.5,
 'dual_sim': 1,
 'fc': 1,
 'four_g': 0,
 'int_memory': 34,
 'm_dep': 0.7,
 'mobile_wt': 83,
 'n_cores': 4,
 'pc': 3,
 'px_height': 250,
 'px_width': 1033,
 'ram': 3419,
 'sc_h': 7,
 'sc_w': 5,
 'talk_time': 5,
 'three_g': 1,
 'touch_screen': 1,
 'wifi': 0}

In [132]:
dict_1= {'battery_power': 1454,
 'blue': 1,
 'clock_speed': 0.5,
 'dual_sim': 1,
 'fc': 1,
 'four_g': 0,
 'int_memory': 34,
 'm_dep': 0.7,
 'mobile_wt': 83,
 'n_cores': 4,
 'pc': 3,
 'px_height': 250,
 'px_width': 1033,
 'ram': 3419,
 'sc_h': 7,
 'sc_w': 5,
 'talk_time': 5,
 'three_g': 1,
 'touch_screen': 1,
 'wifi': 0}
#payload  = json.dumps(dict_1)
# Convert the list to a JSON string
json_string = json.dumps(dict_1)

# Encode the JSON string to bytes
data_bytes = json_string.encode('utf-8')

#data  = json.loads(json.dumps(dict_1))
#dict_bytes = serialized_dict.encode('utf-8')
#dict_bytes

In [133]:
2

2

In [134]:
import os
import io
import boto3
import json
import csv

print("1111111")
# grab environment variables
#ENDPOINT_NAME = "arn:aws:sagemaker:us-east-1:039229394722:endpoint/custom-sklearn-model-2024-02-25-02-46-53"
ENDPOINT_NAME = "smodel"
runtime= boto3.client('runtime.sagemaker')
print("2222222")
payload = testX[features][0:1].values.tolist() # payload # testX[features][0:1].to_dict('index')[405]
print(payload)
print("33333333")
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='text/csv',
                                       Body=data_bytes)
print(response)
print("4444444")
result = json.loads(response['Body'].read().decode())
print(result)
print("55555555")
pred = int(result['predictions'][0]['score'])
predicted_label = 'M' if pred == 1 else 'B'
print("666666666")

1111111
2222222
[[1454.0, 1.0, 0.5, 1.0, 1.0, 0.0, 34.0, 0.7, 83.0, 4.0, 3.0, 250.0, 1033.0, 3419.0, 7.0, 5.0, 5.0, 1.0, 1.0, 0.0]]
33333333


ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/smodel in account 039229394722 for more information.

In [59]:
#data = json.loads(json.dumps(event))
import json

# JSON string
json_string = '{"name": "John", "age": 30, "city": "New York"}'

# Parse JSON string to Python dictionary
parsed_dict = json.loads(json_string)
print(type(parsed_dict))
# Accessing the parsed data
print(parsed_dict["name"])  # Output: John
print(parsed_dict["age"])   # Output: 30
print(parsed_dict["city"])  # Output: New York


<class 'dict'>
John
30
New York


In [52]:
lambda_handler({'nuevo'},"")

TypeError: Object of type set is not JSON serializable