# Predicting the Quality of Red Wine 

## Environment Setup

- Image: Data Science
- Kernel: Python 3
- Instance type: ml.t3.medium

## Background

This notebook illustrates how to use Script Mode in SageMaker Studio by using random forest regression with scikit-learn.  Once trained, the model will predict the quality of wine, given its various features.

Input: *winequality-red.csv*, which contains 11 features for the wine plus a target feature for 'quality.'  Dataset taken from the [UCI Archives](https://archive.ics.uci.edu/ml/datasets/wine+quality). 

We use the SKLearn Estimator, pointing to our train.py file as the entry point.  The train.py file contains custom training/inference code that SageMaker will run.

## Initialize Environment and Variables

In [2]:
# Import libraries
import boto3
import pandas as pd
import numpy as np
import time
import json
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import CSVSerializer
from sagemaker.inputs import TrainingInput
from sagemaker.sklearn import SKLearn

# Get the SageMaker session and the execution role from the SageMaker domain
sess = sagemaker.Session()
role = get_execution_role()

bucket = 'test-sagemaker-script-mode-scikit-12052023' # Update with the name of a bucket that is already created in S3
prefix = 'demo' # The name of the folder that will be created in the S3 bucket

## Data

For this lesson, we'll take the local CSV file and split it 70/30 into training and validation sets.  Then we'll take these local files and upload them to S3 so SageMaker can use them.

In [3]:
# Read the data from the local CSV file and print the first five rows
df = pd.read_csv('winequality-red.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
# Split data 70/30 for training and testing (there are 1,600 total rows, so cells 528 and 529 are where we split)
train = df.iloc[:528,:]
validation = df.iloc[529:,:]

# Create CSVs for train and validation data
train.to_csv('train.csv', index=False)
validation.to_csv('validation.csv', index=False)

# Upload training and validation data to the S3 bucket
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

# The location of our training and validation data in S3
s3_input_train = TrainingInput(
    s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv'
)
s3_input_validation = TrainingInput(
    s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv'
)

## Train

Now that we have our data in S3, we can move on to training.  In this section, we create the SKLearn estimator, with an entry point to our *train.py* script.  More information on the SKLearn estimator can be found [here](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html).

In [5]:
# Set up the SKLearn estimator, with entry point to our Python script
sk_estimator = SKLearn(entry_point='train.py', 
                       role=role,
                       instance_count=1, 
                       instance_type='ml.m5.large',
                       py_version='py3',
                       framework_version='0.23-1',
                       script_mode=True,
                       hyperparameters={
                              'estimators': 20
                            }
                       )

In [8]:
# "fit" executes the training job
sk_estimator.fit({'train': s3_input_train}) 

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2023-05-12-11-20-10-045


2023-05-12 11:20:10 Starting - Starting the training job...
2023-05-12 11:20:27 Starting - Preparing the instances for training...
2023-05-12 11:21:15 Downloading - Downloading input data...
2023-05-12 11:21:46 Training - Downloading the training image...
2023-05-12 11:22:16 Training - Training image download completed. Training in progress..[34m2023-05-12 11:22:21,416 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-05-12 11:22:21,419 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-12 11:22:21,463 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-05-12 11:22:21,648 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-12 11:22:21,660 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-12 11:22:21,672 sagemaker-training-toolkit INFO     No GPUs detect

## Deploy

OPTIONAL for this lesson.  Now that our model has been trained, we can create an endpoint and deploy it.  Once it's deployed, we can pass in sample data to get a prediction on wine quality.

Be sure to update the *endpoint_name* two cells below here.

In [9]:
# Create an endpoint
sk_endpoint_name = 'sklearn-rf-model'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

# Print the name of the endpoint so it can be used in the cell below
print('Endpoint name: ' + sk_endpoint_name)

# Deploy the model to the endpoint (this will take some time to complete)
sk_predictor = sk_estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large',
                                   endpoint_name=sk_endpoint_name)

INFO:sagemaker:Creating model with name: sagemaker-scikit-learn-2023-05-12-11-26-02-562


Endpoint name: sklearn-rf-model2023-05-12-11-26-02


INFO:sagemaker:Creating endpoint-config with name sklearn-rf-model2023-05-12-11-26-02
INFO:sagemaker:Creating endpoint with name sklearn-rf-model2023-05-12-11-26-02


----!

In [12]:
# Pass sample data to get a prediction of wine quality
client = boto3.client('sagemaker-runtime')
content_type = 'application/json'

endpoint_name = 'sklearn-rf-model2023-05-12-11-26-02' # Update with the name of your endpoint that was printed in the cell above

# These are the values for a random wine record.  This particular wine should have a quality score of 6.
request_body = {'Input': [[5.3, 0.47, 0.11, 2.2, 0.048, 16, 89, 0.99182, 3.54, 0.88, 13.56666667]]}

# Serialize data
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)

# Invoke the endpoint, passing in the sample wine data
response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=payload)
result = json.loads(response['Body'].read().decode())['Output']

# Output the result, which is the wine quality score
result

5