# Mobile Price Classification using SKLearn Custom Script in Sagemaker

## Setting sagemaker env

This Python script utilizes the SageMaker SDK, which is a set of Python libraries provided by Amazon Web Services (AWS) for working with SageMaker, their machine learning platform. 

1. `import sagemaker`: Imports the SageMaker Python library, which provides classes and functions to interact with SageMaker services.

2. `from sklearn.model_selection import train_test_split`: Imports the `train_test_split` function from the scikit-learn library, which is commonly used for splitting datasets into training and testing subsets.

3. `import boto3`: Imports the Boto3 library, which is the AWS SDK for Python. Boto3 allows Python developers to write software that uses services like Amazon S3 and Amazon EC2.

4. `import pandas as pd`: Imports the Pandas library, which provides data manipulation and analysis tools.

5. `sm_boto3 = boto3.client("sagemaker")`: Creates a Boto3 client for SageMaker service, which allows interaction with SageMaker APIs using Boto3.

6. `sess = sagemaker.Session()`: Creates a SageMaker session object. This session is used to perform various tasks related to SageMaker, such as creating training jobs, deploying models, and interacting with endpoints.

7. `region = sess.boto_session.region_name`: Retrieves the AWS region name from the session object.

8. `bucket = 'e2e-sagemaker-proj1'`: Assigns the name of an S3 bucket to the variable `bucket`. This is the bucket where data and model artifacts will be stored.

9. `print("Using bucket " + bucket)`: Prints a message indicating which S3 bucket is being used.

Overall, this code sets up the environment for working with SageMaker, including configuring the session, setting the AWS region, and specifying the S3 bucket to be used for storing data and model artifacts.

In [1]:
import sagemaker
from sklearn.model_selection import train_test_split
import boto3
import pandas as pd

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = 'e2e-sagemaker-proj1' # Created S3 bucket name here
print("Using bucket " + bucket)

sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\prati\AppData\Local\sagemaker\sagemaker\config.yaml
Using bucket e2e-sagemaker-proj1


## Read the data

We have a dataset of 2000 mobile phones which belong to 4 price ranges: 0, 1, 2, 3. We perform a multiclass classification to determine the price range of a mobile given its features. The data is balanced. 

In [2]:
df = pd.read_csv("mob_price_classification_train.csv")
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


## Data Exploration

In [3]:
# Shape of the data
df.shape

(2000, 21)

In [4]:
#  Retrieve the normalized frequency distribution of unique values in the 'price_range' column of a DataFrame df
df['price_range'].value_counts(normalize=True)

price_range
1    0.25
2    0.25
3    0.25
0    0.25
Name: proportion, dtype: float64

In [5]:
# Print the columns of the dataframe df
df.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [6]:
# Find the Percentage of Values that are missing. Output shows no missing values
df.isnull().mean() * 100

battery_power    0.0
blue             0.0
clock_speed      0.0
dual_sim         0.0
fc               0.0
four_g           0.0
int_memory       0.0
m_dep            0.0
mobile_wt        0.0
n_cores          0.0
pc               0.0
px_height        0.0
px_width         0.0
ram              0.0
sc_h             0.0
sc_w             0.0
talk_time        0.0
three_g          0.0
touch_screen     0.0
wifi             0.0
price_range      0.0
dtype: float64

## Data Pre-Processing

In [7]:
# Create a list of all features
features = list(df.columns)
features

['battery_power',
 'blue',
 'clock_speed',
 'dual_sim',
 'fc',
 'four_g',
 'int_memory',
 'm_dep',
 'mobile_wt',
 'n_cores',
 'pc',
 'px_height',
 'px_width',
 'ram',
 'sc_h',
 'sc_w',
 'talk_time',
 'three_g',
 'touch_screen',
 'wifi',
 'price_range']

In [8]:
# Identify the label column
label = features.pop(-1)
label

'price_range'

In [9]:
# Store features and labels in two dataframes x and y respectively
x = df[features]
y = df[label]

In [13]:
print("Shape of features dataframes",x.shape)
# Print the features of first five records
x.head()

Shape of features dataframes (2000, 20)


Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0


In [19]:
# Print the labels of first five records
y.head()

0    1
1    2
2    2
3    2
4    1
Name: price_range, dtype: int64

## Perform Train Test Split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.15, random_state=0)

# print shapes of train and test dataframes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1700, 20)
(300, 20)
(1700,)
(300,)


In [16]:
# Create Train and Test dataframes to be stored for further use
trainX = pd.DataFrame(X_train)
trainX[label] = y_train

testX = pd.DataFrame(X_test)
testX[label] = y_test

# Print shape of new dataframes
print("Shape of Train dataframe",trainX.shape)
print("Shape of Test dataframe",testX.shape)

Shape of Train dataframe (1700, 21)
Shape of Test dataframe (300, 21)


In [17]:
# Save the train and test dataframes 
trainX.to_csv("train-V-1.csv",index = False)
testX.to_csv("test-V-1.csv", index = False)

## **Data Ingestion:** Send the Train and Test CSV files to S3 bucket 
SageMaker will take training data from s3. This code segment is uploading data to an Amazon S3 bucket, which will be used by SageMaker for training a machine learning model. Let's break it down:

1. `# Send data to S3. SageMaker will take training data from s3`: This is a comment explaining that the purpose of the following code is to upload data to an S3 bucket for SageMaker to use during training.

2. `sk_prefix = "sagemaker/mobile_price_classification/sklearncontainer"`: This line defines a prefix to be used for the keys (paths) of the files uploaded to S3. It's a naming convention that helps organize files within the S3 bucket. Here, it indicates that the data files are related to a SageMaker project for mobile price classification using the Scikit-learn container.

3. `trainpath = sess.upload_data(path="train-V-1.csv", bucket=bucket, key_prefix=sk_prefix)`: This line uploads the file named "train-V-1.csv" to the specified S3 bucket (`bucket`) with the key (path) constructed using the `key_prefix`. The `upload_data` method is provided by the SageMaker session (`sess`), and it automatically uploads the file to S3 and returns the S3 path where the file is uploaded. This path is stored in the variable `trainpath`.

4. `testpath = sess.upload_data(path="test-V-1.csv", bucket=bucket, key_prefix=sk_prefix)`: Similar to the previous line, this line uploads the file named "test-V-1.csv" to the same S3 bucket with the same key prefix. The S3 path where the file is uploaded is stored in the variable `testpath`.

5. `print(trainpath)`: Prints out the S3 path where the training data file was uploaded.

6. `print(testpath)`: Prints out the S3 path where the testing data file was uploaded.

Overall, this code prepares the data by uploading it to an S3 bucket, making it accessible to SageMaker for training and testing machine learning models.

In [18]:
# Print the bucket name
bucket

'e2e-sagemaker-proj1'

In [20]:
# Send data to S3. SageMaker will take training data from s3
sk_prefix = "sagemaker/mobile_price_classification/sklearncontainer"
trainpath = sess.upload_data(
    path="train-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)

testpath = sess.upload_data(
    path="test-V-1.csv", bucket=bucket, key_prefix=sk_prefix
)
print(trainpath)
print(testpath)

s3://e2e-sagemaker-proj1/sagemaker/mobile_price_classification/sklearncontainer/train-V-1.csv
s3://e2e-sagemaker-proj1/sagemaker/mobile_price_classification/sklearncontainer/test-V-1.csv


## Creating script.py from Sagemaker documentation
Sets the flow for training a Random Forest classifier with given arguments using the train and test files from S3 bucket, stores the trained model and displays the accuracy when executed

In [21]:
%%writefile script.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_curve, auc
import sklearn
import joblib
import boto3
import pathlib
from io import StringIO 
import argparse
import joblib
import os
import numpy as np
import pandas as pd

# Loading the model    
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

# script.py will execute line by line    
if __name__ == "__main__":

    print("[INFO] Extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script. Specific to Random Forest classifier
    parser.add_argument("--n_estimators", type=int, default=100)
    parser.add_argument("--random_state", type=int, default=0)

    # Data, model, and output directories. Arguments required to be passed to Sagemaker for model training
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR")) # default
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) # default
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) # default
    parser.add_argument("--train-file", type=str, default="train-V-1.csv")
    parser.add_argument("--test-file", type=str, default="test-V-1.csv")

    args, _ = parser.parse_known_args()
    
    print("SKLearn Version: ", sklearn.__version__)
    print("Joblib Version: ", joblib.__version__)

    print("[INFO] Reading data")
    print()
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    features = list(train_df.columns)
    label = features.pop(-1)
    
    print("Building training and testing datasets")
    print()
    X_train = train_df[features]
    X_test = test_df[features]
    y_train = train_df[label]
    y_test = test_df[label]

    print('Column order: ')
    print(features)
    print()
    
    print("Label column is: ",label)
    print()
    
    print("Data Shape: ")
    print()
    print("---- SHAPE OF TRAINING DATA (85%) ----")
    print(X_train.shape)
    print(y_train.shape)
    print()
    print("---- SHAPE OF TESTING DATA (15%) ----")
    print(X_test.shape)
    print(y_test.shape)
    print()
    
  
    print("Training RandomForest Model.....")
    print()
    model =  RandomForestClassifier(n_estimators=args.n_estimators, random_state=args.random_state, verbose = 3,n_jobs=-1)
    model.fit(X_train, y_train)
    print()
    

    model_path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model,model_path)
    print("Model persisted at " + model_path)
    print()

    
    y_pred_test = model.predict(X_test)
    test_acc = accuracy_score(y_test,y_pred_test)
    test_rep = classification_report(y_test,y_pred_test)

    print()
    print("---- METRICS RESULTS FOR TESTING DATA ----")
    print()
    print("Total Rows are: ", X_test.shape[0])
    print('[TESTING] Model Accuracy is: ', test_acc)
    print('[TESTING] Testing Report: ')
    print(test_rep)

Writing script.py


## Sagemaker utilizing scripts.py

In [35]:
# Importing sagemaker's default SKLearn library
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    # created above
    entry_point="script.py",

    # ARN of a new sagemaker role (ARN of new user does not work)
    role="arn:aws:iam::725942761963:role/e2e-mobrole-sagemaker",

    # creates instance inside the Sagemaker machine
    instance_count=1,
    instance_type="ml.m5.large",

    # framework version present in the documentation, declared above
    framework_version=FRAMEWORK_VERSION,

    # name of folder after model has been trained
    base_job_name="RF-custom-sklearn",

    # hyperparameters to the RF classifier
    hyperparameters={
        "n_estimators": 100,
        "random_state": 0,
    },
    use_spot_instances = True,
    max_wait = 7200,
    max_run = 3600
)

## Train the model on Sagemaker

In [36]:
# Launch the training job as an asynchronous call- begin creating an instance in the Sagemaker and start training
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

# sklearn_estimator.fit({"train": datapath}, wait=True)

INFO:sagemaker:Creating training-job with name: RF-custom-sklearn-2024-01-31-03-29-56-104


2024-01-31 03:29:57 Starting - Starting the training job...
2024-01-31 03:30:11 Starting - Preparing the instances for training......
2024-01-31 03:31:27 Downloading - Downloading input data...
2024-01-31 03:31:57 Downloading - Downloading the training image...
2024-01-31 03:32:33 Training - Training image download completed. Training in progress..2024-01-31 03:32:36,364 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2024-01-31 03:32:36,367 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-01-31 03:32:36,408 sagemaker_sklearn_container.training INFO     Invoking user training script.
2024-01-31 03:32:36,557 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-01-31 03:32:36,569 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-01-31 03:32:36,582 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-01-31 

In [37]:
# Print some more information about the trained model
sklearn_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]

# Prints the exact location of the model in the S3 bucket
print("Model artifact persisted at artifact:" + artifact)


2024-01-31 03:32:54 Starting - Preparing the instances for training
2024-01-31 03:32:54 Downloading - Downloading the training image
2024-01-31 03:32:54 Training - Training image download completed. Training in progress.
2024-01-31 03:32:54 Uploading - Uploading generated training model
2024-01-31 03:32:54 Completed - Training job completed
Model artifact persisted at artifact:s3://sagemaker-us-east-1-725942761963/RF-custom-sklearn-2024-01-31-03-29-56-104/output/model.tar.gz


## Deploy the Sagemaker model

In [38]:
# create a copy of the trained model which can be used to deploy
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime

# identify the new location of the model
model_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model = SKLearnModel(
    name =  model_name,
    model_data=artifact,
    role="arn:aws:iam::725942761963:role/e2e-mobrole-sagemaker",
    entry_point="script.py",
    framework_version=FRAMEWORK_VERSION,
)

In [32]:
# print the model name
model_name

'Custom-sklearn-model-2024-01-31-03-26-16'

In [39]:
# Endpoints deployment. We can use the predictor.predict for any new data. Takes time as it also deploys on an instance
endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

predictor = model.deploy(
    initial_instance_count=1,

    # deploy in this specific instance as an endpoint
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
)

EndpointName=Custom-sklearn-model-2024-01-31-03-33-22


INFO:sagemaker:Creating model with name: Custom-sklearn-model-2024-01-31-03-33-22
INFO:sagemaker:Creating endpoint-config with name Custom-sklearn-model-2024-01-31-03-33-22
INFO:sagemaker:Creating endpoint with name Custom-sklearn-model-2024-01-31-03-33-22


------!

## Testing the Deployment

In [40]:
# print the predictor type
predictor

<sagemaker.sklearn.model.SKLearnPredictor at 0x23974c64fd0>

In [41]:
# Take a sample example from test data say first two records
testX[features][0:2].values.tolist()

[[1454.0,
  1.0,
  0.5,
  1.0,
  1.0,
  0.0,
  34.0,
  0.7,
  83.0,
  4.0,
  3.0,
  250.0,
  1033.0,
  3419.0,
  7.0,
  5.0,
  5.0,
  1.0,
  1.0,
  0.0],
 [1092.0,
  1.0,
  0.5,
  1.0,
  10.0,
  0.0,
  11.0,
  0.5,
  167.0,
  3.0,
  14.0,
  468.0,
  571.0,
  737.0,
  14.0,
  4.0,
  11.0,
  0.0,
  1.0,
  0.0]]

In [42]:
# use the deployed model to predict the price range of the two examples
print(predictor.predict(testX[features][0:2].values.tolist()))

[3 0]


In [44]:
# delete endpoint to avoid charges
sm_boto3.delete_endpoint(EndpointName=endpoint_name) 

{'ResponseMetadata': {'RequestId': '8cc75e45-ab63-47cb-96fa-d69366e0764e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8cc75e45-ab63-47cb-96fa-d69366e0764e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 31 Jan 2024 04:03:40 GMT'},
  'RetryAttempts': 0}}

## Testing Deployment using new data

In [None]:
import boto3

# Initialize the SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')

# Specify the endpoint name
endpoint_name = endpoint_name

# Load your data for prediction
# Assuming 'data' is a Pandas DataFrame or serialized input data
# Ensure the data format matches the input format expected by your model

# Convert data to CSV format if necessary
# For example, if 'data' is a DataFrame
data_csv = data.to_csv(index=False)

# Make predictions using the endpoint
# Specify the content type according to the format your model expects
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=data_csv,  # Input data in CSV format
    ContentType='text/csv',  # Content type for the request
    Accept='text/csv'  # Accept header for the response
)

# Parse the response
predictions = response['Body'].read().decode('utf-8')

# Handle the predictions as needed
print(predictions)
