# WINE QUALITY PREDICTION APP USING AWS SAGEMAKER'S IN-BUILT XGBOOST - End-to-End
We will build a Wine Quality Prediction App to help determine the quality of wine from its composition:
- I)   PROBLEM STATEMENT & DATA COLLECTION:

You want to automatically determine the quality of wine from it's underlying components.The data is taken from the UCI datasets and you can get it from the link.  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

This data will be cleaned and trained using the in-built XGBoost Algorithm on AWS Sagemaker, and an endpoint will be created in AWS ,which wll be used to make predictions when given the inputs on the components in the wine.

- II)  PERFORM EXPLORATORY DATA ANALYSIS 

Inspect the data to validate the quality of the data downloaded from te UCI website. Analyse the distribution of missing values, outliers and gain other relevant insights from the model
- III) DO FEATURE ENGINEERING & SELECTION

Handle the mising values, outliers and do the necessary transformations which will ensure the data is well suited for the machine learning model.And also to maximise the insights gotten from the Exploratory Data Analysis phase.
- IV)  BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER

The Boto3 package will be used to create the S3 buckets to store the preprocessed dataset.The Sagemaker's inbuilt XGBoost algorithm, will be built, trained and deployed.Including the use of optimal hyperparameters to get the best results for the RMSE( Root Mean Squared Error).An Endpoint will be created after the model is built.
The Endpoint created awill be used to predict the quality of wine when the input compositions are fed to the endpoint.

### V) BUILD,TRAIN AND DEPLOY THE MODEL IN SAGEMAKER
We will perform the following tasks, in order to successully scrape the data we need
- a.) Importing the necessary Libraries and create S3 bucket
- b.) Download the train and test data and store in S3
- c.) Build and Train the Inbuilt XGBoost model
- d.) Deploy the model to an Endpoint
- e.) Test the predictions
- f.) Delete the Endpoint
- g.) Conclusion

#### a.) Importing all the necessary libraries and creating S3 bucket

In [30]:
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri 
from sagemaker.image_uris import retrieve
from sagemaker.session import s3_input, Session

In [31]:
bucket_name = 'winequalityapp' # <--- Give this a unique name, since there can be no 02 bucket names in AWS
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

us-east-1


In [32]:
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


In [5]:
# set an output path where the trained model will be saved
prefix = 'xgboost-inbuilt-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://winequalityapp/xgboost-inbuilt-algo/output


#### b.) Download the train and test data and store in S3

In [6]:
import pandas as pd

import numpy as np
import urllib

pd.set_option("display.max_columns", None) #setting pandas to display all columns

In [7]:
#Importing the train dataset
try:
    urllib.request.urlretrieve ("https://raw.githubusercontent.com/Bandolo/winequality-studiolab/master/train_clean.csv", "train_clean.csv")
    print('Success: downloaded train_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    train_clean = pd.read_csv('./train_clean.csv')
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded train_clean.csv.
Success: Data loaded into dataframe.


In [8]:
#Importing the test dataset
try:
    urllib.request.urlretrieve ("https://raw.githubusercontent.com/Bandolo/winequality-studiolab/master/test_clean.csv", "test_clean.csv")
    print('Success: downloaded train_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    test_clean = pd.read_csv('./test_clean.csv')
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: downloaded train_clean.csv.
Success: Data loaded into dataframe.


In [9]:
print(test_clean.shape)

(320, 12)


In [10]:
print(train_clean.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            9.9             0.540         0.45             2.3      0.071   
1           10.8             0.260         0.45             3.3      0.060   
2            9.9             0.350         0.55             2.1      0.062   
3            5.6             0.850         0.05             1.4      0.045   
4            6.6             0.725         0.09             5.5      0.117   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 16.0                  40.0  0.99910  3.39       0.62   
1                 20.0                  49.0  0.99720  3.13       0.54   
2                  5.0                  14.0  0.99710  3.26       0.79   
3                 12.0                  88.0  0.99240  3.56       0.82   
4                  9.0                  17.0  0.99655  3.35       0.49   

   alcohol  quality2  
0      9.4         1  
1      9.6         1  
2     10.6       

In [11]:
print(test_clean.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0           10.8             0.470         0.43            2.10      0.171   
1            8.1             0.820         0.00            4.10      0.095   
2            9.1             0.290         0.33            2.05      0.063   
3           10.2             0.645         0.36            1.80      0.053   
4           12.2             0.450         0.49            1.40      0.075   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 27.0                  66.0  0.99820  3.17       0.76   
1                  5.0                  14.0  0.99854  3.36       0.53   
2                 13.0                  27.0  0.99516  3.26       0.84   
3                  5.0                  14.0  0.99820  3.17       0.42   
4                  3.0                   6.0  0.99690  3.13       0.63   

   alcohol  quality2  
0     10.8         2  
1      9.6         1  
2     11.7       

In [12]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
pd.concat([train_clean['quality2'], train_clean.drop(['quality2'], 
                                                axis=1)], 
                                                axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

In [13]:
# Test Data Into Buckets
pd.concat([test_clean['quality2'], test_clean.drop(['quality2'], 
                                              axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

#### c.) Build and Train the Inbuilt XGBoost model

In [14]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container = retrieve('xgboost',boto3.Session().region_name,'latest')

In [15]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.25",
        "gamma":"0.3",
        "min_child_weight":"7",
        "subsample":"1",
        "objective": "multi:softmax",
        "num_class": "10",
        "num_round":50
        }

In [None]:
# sagemaker role
sagemaker_role = 'arn:aws:iam::609738416112:role/sagemaker-full-access-role'

In [16]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          # role=sagemaker.get_execution_role(),
                                          role = sagemaker_role,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path,
                                          use_spot_instances=True,
                                          max_run=300,
                                          max_wait=600)

In [None]:
estimator.fit({'train': s3_input_train,'validation': s3_input_test})

2023-01-15 20:17:15 Starting - Starting the training job...
2023-01-15 20:17:42 Starting - Preparing the instances for trainingProfilerReport-1673813835: InProgress
......
2023-01-15 20:18:45 Downloading - Downloading input data...
2023-01-15 20:19:05 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2023-01-15:20:19:24:INFO] Running standalone xgboost training.[0m
[34m[2023-01-15:20:19:24:INFO] File size need to be processed in the node: 0.09mb. Available memory size in the node: 24004.79mb[0m
[34m[2023-01-15:20:19:24:INFO] Determined delimiter of CSV input is ','[0m
[34m[20:19:24] S3DistributionType set as FullyReplicated[0m
[34m[20:19:24] 1279x11 matrix with 14069 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2023-01-15:20:19:24:INFO] Determined delimiter of CSV input is ','[0m
[34m[20:19:24] S3DistributionType set as FullyReplicated[0m
[34m[20:19:24] 320x11 matrix with 3520 

#### d.) Deployment of the model

In [21]:
# Setup clients
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")

In [19]:
# Retrieve model data from training job
model_artifacts = estimator.model_data
model_artifacts

's3://winequalityapp/xgboost-inbuilt-algo/output/xgboost-2023-01-15-20-17-15-146/output/model.tar.gz'

#### e.) Model creation

In [24]:
from time import gmtime, strftime

model_name = "xgboost-serverless-inf" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# dummy environment variables
byo_container_env_vars = {"SAGEMAKER_CONTAINER_LOG_LEVEL": "20", "SOME_ENV_VAR": "myEnvVar"}

create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": container,
            "Mode": "SingleModel",
            "ModelDataUrl": model_artifacts,
            "Environment": byo_container_env_vars,
        }
    ],
    ExecutionRoleArn=sagemaker.get_execution_role(),
)

print("Model Arn: " + create_model_response["ModelArn"])

Model name: xgboost-serverless-inf2023-01-15-20-28-46
Model Arn: arn:aws:sagemaker:us-east-1:609738416112:model/xgboost-serverless-inf2023-01-15-20-28-46


#### f.) Configuration of Serverless Endpoint

In [25]:
xgboost_epc_name = "xgboost-serverless-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
            },
        },
    ],
)

print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

Endpoint Configuration Arn: arn:aws:sagemaker:us-east-1:609738416112:endpoint-config/xgboost-serverless-epc2023-01-15-20-34-37


#### g.) Serverless Endpoint Creation

In [27]:
endpoint_name = "xgboost-serverless-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=xgboost_epc_name,
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:609738416112:endpoint/xgboost-serverless-ep2023-01-15-20-36-16


In [28]:
# wait for endpoint to reach a terminal state (InService) using describe endpoint
import time

describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)

describe_endpoint_response

Creating
Creating
Creating
InService


{'EndpointName': 'xgboost-serverless-ep2023-01-15-20-36-16',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:609738416112:endpoint/xgboost-serverless-ep2023-01-15-20-36-16',
 'EndpointConfigName': 'xgboost-serverless-epc2023-01-15-20-34-37',
 'ProductionVariants': [{'VariantName': 'byoVariant',
   'DeployedImages': [{'SpecifiedImage': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
     'ResolvedImage': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost@sha256:89aca00873e8a1e52e0454934e24c94027da61fe4f3ad465982db2dccb9b09fe',
     'ResolutionTime': datetime.datetime(2023, 1, 15, 20, 36, 17, 694000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 0,
   'CurrentServerlessConfig': {'MemorySizeInMB': 4096, 'MaxConcurrency': 1}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2023, 1, 15, 20, 36, 17, 152000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 1, 15, 20, 38, 5, 340000, tzinfo=tz

#### e.) Test the predictions

In [None]:
#from sagemaker.predictor import csv_serializer
from sagemaker.serializers import CSVSerializer

test_data_array = test_clean.drop(['quality2'], axis=1).values #load the data into an array
#xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = CSVSerializer() # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

In [41]:
pd.read_csv('test.csv').head(2)

Unnamed: 0,2,10.8,0.47,0.43,2.1,0.171,27.0,66.0,0.9982,3.17,0.76,10.8.1
0,1,8.1,0.82,0.0,4.1,0.095,5.0,14.0,0.99854,3.36,0.53,9.6
1,3,9.1,0.29,0.33,2.05,0.063,13.0,27.0,0.99516,3.26,0.84,11.7


In [42]:
body = b"9.1,0.29,0.33,2.05,0.063,13.0,27.0,0.99516,3.26,0.84,11.7"

In [43]:
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=body,
    ContentType="text/csv",
)

print(response["Body"].read())

b'3.0'


#### f.) Deleting the Endpoints 

In [44]:
client.delete_model(ModelName=model_name)
client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name)
client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '4c4ee64f-fb6f-491a-920a-839d132117d5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4c4ee64f-fb6f-491a-920a-839d132117d5',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Mon, 16 Jan 2023 11:55:15 GMT'},
  'RetryAttempts': 0}}