## Train, tune, and deploy a custom ML model using Flight Delay Prediction Algorithm from AWS Marketplace 


This solution predicts flight delays based on factors such as route, airport congestion, airline efficiency etc. using a trainable ML model.



This sample notebook shows you how to train a custom ML model using Flight Delay Prediction from AWS Marketplace.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to For Seller to update: Flight Delay Prediction. 

#### Contents:
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure and visualize train and test dataset](#B.-Configure-and-visualize-train-and-test-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Train a machine learning model](#3:-Train-a-machine-learning-model)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train a model](#3.2-Train-a-model)
1. [Deploy model and verify results](#4:-Deploy-model-and-verify-results)
    1. [Deploy trained model](#A.-Deploy-trained-model)
    1. [Create input payload](#B.-Create-input-payload)
    1. [Perform real-time inference](#C.-Perform-real-time-inference)
    1. [Visualize output](#D.-Visualize-output)
    1. [Calculate relevant metrics](#E.-Calculate-relevant-metrics)
    1. [Delete the endpoint](#F.-Delete-the-endpoint)
1. [Tune your model! (optional)](#5:-Tune-your-model!-(optional))
	1. [Tuning Guidelines](#A.-Tuning-Guidelines)
	1. [Define Tuning configuration](#B.-Define-Tuning-configuration)
	1. [Run a model tuning job](#C.-Run-a-model-tuning-job)
1. [Perform Batch inference](#6.-Perform-Batch-inference)
1. [Clean-up](#7.-Clean-up)
	1. [Delete the model](#A.-Delete-the-model)
	1. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))


#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page Flight Delay Prediction
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
algo_arn ='flight-delay-prediction'

### 2. Prepare dataset

In [2]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from urllib.parse import urlparse
import boto3
import urllib.request
import numpy as np
from zipfile import ZipFile
import pandas as pd

#### A. Dataset format expected by the algorithm

The algorithm requires data in the format as described for best results:
* The inputs must be provided as a CSV file with mandatory information in columns.
* The input data files must contain all columns specified in input data description; other columns will be ignored.
* The input data files must contain the column 'DEPARTURE_DELAY' with Total Delay on Departure in minutes.
* Training Data File name should be train.csv
* Test Data File name should be test.csv
* For detailed instructions, please refer sample notebook and algorithm input details

#### B. Configure and visualize train and test dataset

In [3]:
training_dataset='Training Inputs/training/train.csv'

In [4]:
test_dataset='Training Inputs/test/test.csv'

In [5]:
df = pd.read_csv(training_dataset)
df.head()

Unnamed: 0,DATE,SCHEDULED_DEPARTURE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,AIRLINE,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DEPARTURE_DELAY
0,05/05/2015,07:00,LAX,BOS,AA,,,-5.0
1,30/07/2015,13:40,ATL,IND,WN,,,0.0
2,07/04/2015,10:53,CAE,DFW,EV,0.0,196.0,204.0
3,12/11/2015,17:00,MSP,ATL,WN,,,21.0
4,06/02/2015,06:55,LGA,STL,WN,,,-8.0


#### C. Upload datasets to Amazon S3

In [6]:
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [8]:
# training input location
common_prefix = "flight-delays"
training_input_prefix = common_prefix + "/training-input-data"
TRAINING_WORKDIR = "Training Inputs/training"
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print("Training input uploaded to " + training_input)

Training input uploaded to s3://sagemaker-us-east-2-786796469737/flight-delays/training-input-data


In [9]:
# test input location
test_input_prefix = common_prefix + "/test-input-data"
TEST_WORKDIR = "Training Inputs/test"
test_input = sagemaker_session.upload_data(TEST_WORKDIR, key_prefix=test_input_prefix)
print("Test input uploaded to " + test_input)

Test input uploaded to s3://sagemaker-us-east-2-786796469737/flight-delays/test-input-data


## 3: Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [10]:
role = get_execution_role()

In [11]:
output_location = 's3://{}/flight_delays/{}'.format(bucket, 'output')

### 3.2 Train a model

For information on creating an `Estimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

In [12]:
#Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="flight-delays-training",
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m5.large',
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type='ml.m5.large'
)
#Run the training job.
estimator.fit({"training": training_input, "test":test_input})

2021-11-15 06:44:17 Starting - Starting the training job...
2021-11-15 06:44:40 Starting - Launching requested ML instancesProfilerReport-1636958657: InProgress
...
2021-11-15 06:45:09 Starting - Preparing the instances for training.........
2021-11-15 06:46:40 Downloading - Downloading input data
2021-11-15 06:46:40 Training - Downloading the training image...
2021-11-15 06:47:12 Uploading - Uploading generated training model.[34mStarting the training.[0m
[34mStarting Preprocessing[0m
[34mPreprocessing done[0m
[34mStarting Classification Training[0m
[34mClassification Training done[0m
[34mTraining Performance:[0m
[34mClassification Report:
              precision    recall  f1-score   support
         0.0       0.79      0.99      0.88        77
         1.0       0.75      0.13      0.22        23
    accuracy                           0.79       100
   macro avg       0.77      0.56      0.55       100[0m
[34mweighted avg       0.78      0.79      0.73       100[0m


See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

### 4: Deploy model and verify results

Now you can deploy the model for performing real-time inference.

In [13]:
model_name='flight-delays'

content_type='application/zip'

real_time_inference_instance_type='ml.m5.large'
batch_transform_inference_instance_type='ml.m5.large'

#### A. Deploy trained model

In [14]:
from sagemaker.predictor import csv_serializer
predictor = estimator.deploy(1, real_time_inference_instance_type)

..........
---!

Once endpoint is created, you can perform real-time inference.

#### B. Create input payload

In [25]:
file_name = '"Model Input"/inference.zip'

#### C. Perform real-time inference

In [31]:
output_file_name = '"Model Output"/output.zip'

In [32]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $predictor.endpoint_name \
    --body fileb://$file_name \
    --content-type $content_type \
    --region $sagemaker_session.boto_region_name \
    $output_file_name

{
    "ContentType": "application/zip",
    "InvokedProductionVariant": "AllTraffic"
}


#### D. Visualize output

In [33]:
with ZipFile("output.zip", "r") as output_zip:
    filename = output_zip.namelist()[0]
    output = pd.read_csv(output_zip.open(filename), low_memory=False)
output.head(10)

Unnamed: 0,DATE,SCHEDULED_DEPARTURE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,AIRLINE,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Classifier Prediction,Regressor Prediction
0,18/04/2015,20:45,PHL,MCO,US,,,0.0,14.203183
1,01/05/2015,12:10,EWR,MDW,WN,,,0.0,9.476961
2,16/12/2015,18:00,ORD,LGA,UA,,,0.0,22.118745
3,13/10/2015,09:17,15919,11292,OO,,,0.0,26.177241
4,24/11/2015,08:36,PHX,SFO,UA,0.0,0.0,0.0,26.177241
5,14/10/2015,15:47,14771,11292,UA,,,0.0,26.177241
6,14/01/2015,12:20,DFW,MCO,AA,,,0.0,6.329906
7,02/01/2015,06:00,SMF,PHX,US,,,0.0,12.825333
8,26/03/2015,10:15,MSP,BWI,DL,,,0.0,-6.462379
9,08/08/2015,17:10,MCO,MCI,WN,,,0.0,17.926988


#### F. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [34]:
predictor.delete_endpoint(delete_endpoint_config=True)

Since this is an experiment, you do not need to run a hyperparameter tuning job. However, if you would like to see how to tune a model trained using a third-party algorithm with Amazon SageMaker's hyperparameter tuning functionality, you can run the optional tuning step.

### 5. Perform Batch inference

In this section, you will perform batch inference using multiple input payloads together.

In [35]:
#upload the batch-transform job input files to S3
transform_input_folder = "Model Input/inference.zip"
transform_input = sagemaker_session.upload_data(transform_input_folder, key_prefix=model_name) 
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/flight-delays/inference.zip


In [36]:
#Run the batch-transform job
transformer = estimator.transformer(1, batch_transform_inference_instance_type)
transformer.transform(transform_input, content_type=content_type)
transformer.wait()

..........
.....................
[34m * Serving Flask app 'serve' (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on all addresses.
 * Running on http://169.254.255.131:8080/ (Press CTRL+C to quit)[0m
[34m169.254.255.130 - - [15/Nov/2021 07:04:12] "GET /ping HTTP/1.1" 200 -[0m
[34m169.254.255.130 - - [15/Nov/2021 07:04:12] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34mPreprocessing done[0m
[34mStarting Classification[0m
[34mClassification done[0m
[34mStarting Regression[0m
[34mRegression done[0m
[34m169.254.255.130 - - [15/Nov/2021 07:04:12] "POST /invocations HTTP/1.1" 200 -[0m
[32m2021-11-15T07:04:12.814:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m * Serving Flask app 'serve' (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on all addresses.
 * Running on http://169

In [37]:
#output is available on following path
transformer.output_path

's3://sagemaker-us-east-2-786796469737/flight-delays-training-2021-11-15-07-00-43-832'

#### A. Inspect the Batch Transform Output in S3

In [38]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "inference.zip")

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)

In [39]:
bucketFolder = transformer.output_path.rsplit('/')[3]

In [40]:
import boto3
s3_conn = boto3.client("s3")
bucket_name=bucket
with open('output.zip', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/' + "inference.zip" +'.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [41]:
with ZipFile("output.zip", "r") as output_zip:
    filename = output_zip.namelist()[0]
    output = pd.read_csv(output_zip.open(filename), low_memory=False)

In [42]:
output.head(10)

Unnamed: 0,DATE,SCHEDULED_DEPARTURE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,AIRLINE,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Classifier Prediction,Regressor Prediction
0,18/04/2015,20:45,PHL,MCO,US,,,0.0,14.203183
1,01/05/2015,12:10,EWR,MDW,WN,,,0.0,9.476961
2,16/12/2015,18:00,ORD,LGA,UA,,,0.0,22.118745
3,13/10/2015,09:17,15919,11292,OO,,,0.0,26.177241
4,24/11/2015,08:36,PHX,SFO,UA,0.0,0.0,0.0,26.177241
5,14/10/2015,15:47,14771,11292,UA,,,0.0,26.177241
6,14/01/2015,12:20,DFW,MCO,AA,,,0.0,6.329906
7,02/01/2015,06:00,SMF,PHX,US,,,0.0,12.825333
8,26/03/2015,10:15,MSP,BWI,DL,,,0.0,-6.462379
9,08/08/2015,17:10,MCO,MCI,WN,,,0.0,17.926988


### 6. Clean-up

#### A. Delete the model

In [43]:
transformer.delete_model()

#### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

