### Unit: Data Science Technology and Systems PG (11523), Semester 2 2021
### Assignment: Final Project

### Student: Saud Alshammari
### UNI-No: U3197222

___________________________________________________________________________________________

# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [3]:
import pandas as pd
import numpy as np
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator
from time import strftime, gmtime
from sagemaker.predictor import csv_serializer, csv_deserializer
from time import strftime, gmtime

In [4]:
#dataset = pd.read_csv('C:/Users/Sbals/Documents/Semester4/DataScienceTechnologyandSystemsPG11523/combined_csv_v1.csv')
dataset_v1 = pd.read_csv('combined_csv_v1.csv')
dataset_v2 = pd.read_csv('combined_csv_v2.csv')

print(dataset_v1.shape)
print(dataset_v2.shape)
print(" ")
print(" ")


# Split the data
training_dataset_v1, validation_dataset_v1, testing_dataset_v1 = np.split(dataset_v1.sample(frac=1), [int(.7*len(dataset_v1)), int(.85*len(dataset_v1))])
training_dataset_v2, validation_dataset_v2, testing_dataset_v2 = np.split(dataset_v2.sample(frac=1), [int(.7*len(dataset_v2)), int(.85*len(dataset_v2))])

training_dataset_v1 = training_dataset_v1.dropna(axis=0)
validation_dataset_v1 = validation_dataset_v1.dropna(axis=0)
testing_dataset_v1 = testing_dataset_v1.dropna(axis=0)

training_dataset_v2 = training_dataset_v2.dropna(axis=0)
validation_dataset_v2 = validation_dataset_v2.dropna(axis=0)
testing_dataset_v2 = testing_dataset_v2.dropna(axis=0)

print("dataset_v1")
print(training_dataset_v1.shape)
print(validation_dataset_v1.shape)
print(testing_dataset_v1.shape)
print(" ")
print("dataset_v2")
print(training_dataset_v2.shape)
print(validation_dataset_v2.shape)
print(testing_dataset_v2.shape)


# Save the splited datasets
training_dataset_v1.to_csv('training_dataset_v1.csv', index=False, header=False)
validation_dataset_v1.to_csv('validation_dataset_v1.csv', index=False, header=False)
testing_dataset_v1 = testing_dataset_v1.drop('target', axis = 1).values 

training_dataset_v2.to_csv('training_dataset_v2.csv', index=False, header=False)
validation_dataset_v2.to_csv('validation_dataset_v2.csv', index=False, header=False)
testing_dataset_v2 = testing_dataset_v2.drop('target', axis = 1).values 

####################################################################################################

sess = sagemaker.Session()
bucket = sess.default_bucket()

V1_prefix = 'V1'
training_data_v1_path = sess.upload_data(path='training_dataset_v1.csv', key_prefix=V1_prefix + '/input/training_v1')
validation_data_v1_path = sess.upload_data(path='validation_dataset_v1.csv', key_prefix=V1_prefix + '/input/validation_v1')

V2_prefix = 'V2'
training_data_v2_path = sess.upload_data(path='training_dataset_v2.csv', key_prefix=V2_prefix + '/input/training_v2')
validation_data_v2_path = sess.upload_data(path='validation_dataset_v2.csv', key_prefix=V2_prefix + '/input/validation_v2')

print(" ")
print(training_data_v1_path)
print(validation_data_v1_path)
print(training_data_v2_path)
print(validation_data_v2_path)

(53982, 75)
(38156, 86)
 
 
dataset_v1
(37787, 75)
(8096, 75)
(8098, 75)
 
dataset_v2
(26709, 86)
(5722, 86)
(5724, 86)
 
s3://sagemaker-us-east-1-749577378176/V1/input/training_v1/training_dataset_v1.csv
s3://sagemaker-us-east-1-749577378176/V1/input/validation_v1/validation_dataset_v1.csv
s3://sagemaker-us-east-1-749577378176/V2/input/training_v2/training_dataset_v2.csv
s3://sagemaker-us-east-1-749577378176/V2/input/validation_v2/validation_dataset_v2.csv


In [6]:
# simple model on combined_csv_v1

V1_region = boto3.Session().region_name    
V1_container = get_image_uri(V1_region, 'linear-learner')
V1_role = sagemaker.get_execution_role() 

V1_model_1 = Estimator(V1_container,
    role= V1_role, 
    train_instance_count=1,
    train_instance_type='ml.m4.4xlarge',
    output_path='s3://{}/{}/output'.format(bucket, V1_prefix)
)

V1_model_1.set_hyperparameters(predictor_type='binary_classifier', mini_batch_size=1000)

training_data_v1_channel   = sagemaker.s3_input(s3_data=training_data_v1_path, content_type='text/csv')
validation_data_v1_channel = sagemaker.s3_input(s3_data=validation_data_v1_path, content_type='text/csv')

V1_ll_data = {'train': training_data_v1_channel, 'validation': validation_data_v1_channel}

# Fit the models
V1_model_1.fit(V1_ll_data)

%%bash -s "$V1_model_1.output_path"
aws s3 ls --recursive $1

##################################

timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = 'linear-learner-demo-'+timestamp
print(endpoint_name)

V1_model_1_predictor = V1_model_1.deploy(endpoint_name=endpoint_name, 
                        initial_instance_count=1, 
                        instance_type='ml.m4.xlarge')


V1_model_1_predictor.content_type = 'text/csv'
V1_model_1_predictor.serializer = csv_serializer
V1_model_1_predictor.deserializer = csv_deserializer


v1_response = V1_model_1_predictor.predict(testing_dataset_v1)

print(" ")
print(v1_response)

2021-11-15 13:16:04 Starting - Starting the training job...
2021-11-15 13:16:05 Starting - Launching requested ML instances......
2021-11-15 13:17:20 Starting - Preparing the instances for training.........
2021-11-15 13:18:43 Downloading - Downloading input data...
2021-11-15 13:19:31 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/15/2021 13:19:36 INFO 139729209104192] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', 'o

In [18]:
# simple model on combined_csv_v2

V2_region = boto3.Session().region_name    
V2_container = get_image_uri(V2_region, 'linear-learner')
V2_role = sagemaker.get_execution_role() 

V2_model_1 = Estimator(V2_container,
    role= V2_role, 
    train_instance_count=1,
    train_instance_type='ml.m4.4xlarge',
    output_path='s3://{}/{}/output'.format(bucket, V2_prefix)
)


V2_model_1.set_hyperparameters(predictor_type='binary_classifier', mini_batch_size=1000)

training_data_v2_channel   = sagemaker.s3_input(s3_data=training_data_v2_path, content_type='text/csv')
validation_data_v2_channel = sagemaker.s3_input(s3_data=validation_data_v2_path, content_type='text/csv')

V2_ll_data = {'train': training_data_v2_channel, 'validation': validation_data_v2_channel}

# Fit the models
V2_model_1.fit(V2_ll_data)


############################################################################################################################

timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = 'linear-learner-demo-'+timestamp
print(endpoint_name)

V2_model_1_predictor = V2_model_1.deploy(endpoint_name=endpoint_name, 
                        initial_instance_count=1,
                        #update_endpoint=True,
                        instance_type='ml.m4.xlarge')

V2_model_1_predictor.content_type = 'text/csv'
V2_model_1_predictor.serializer = csv_serializer
V2_model_1_predictor.deserializer = csv_deserializer


v2_response = V2_model_1_predictor.predict(testing_dataset_v2)

print(" ")
print(v2_response)

2021-11-15 13:42:40 Starting - Starting the training job......
2021-11-15 13:43:21 Starting - Launching requested ML instances......
2021-11-15 13:44:29 Starting - Preparing the instances for training.........
2021-11-15 13:46:04 Downloading - Downloading input data...
2021-11-15 13:46:40 Training - Downloading the training image...
2021-11-15 13:47:00 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/15/2021 13:47:06 INFO 139731162273600] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform',

In [5]:
# Write the final comments here and turn the cell type into markdown

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [21]:
# simple model on combined_csv_v1
V1_region = boto3.Session().region_name    
V1_container = get_image_uri(V1_region, 'xgboost', repo_version='1.0-1')
V1_role = sagemaker.get_execution_role() 

V1_xgb = Estimator(V1_container,
    role=V1_role, 
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, V1_prefix))

V1_xgb.set_hyperparameters(objective='binary:logistic', num_round=200, early_stopping_rounds=10)

training_data_v1_channel   = sagemaker.s3_input(s3_data=training_data_v1_path, content_type='text/csv')
validation_data_v1_channel = sagemaker.s3_input(s3_data=validation_data_v1_path, content_type='text/csv')

xgb_data = {'train': training_data_v1_channel, 'validation': validation_data_v1_channel}

# Fit the models
V1_xgb.fit(xgb_data)

##################################
timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = 'xgb-demo-'+timestamp
print(endpoint_name)

V1_xgb_predictor = V1_xgb.deploy(endpoint_name=endpoint_name, 
                        initial_instance_count=1, 
                        instance_type='ml.m4.xlarge')


V1_xgb_predictor.content_type = 'text/csv'
V1_xgb_predictor.serializer = csv_serializer
V1_xgb_predictor.deserializer = csv_deserializer

v1_xgb_response = V1_xgb_predictor.predict(testing_dataset_v1)

print(" ")
print(v1_xgb_response)

2021-11-15 14:03:24 Starting - Starting the training job...
2021-11-15 14:03:28 Starting - Launching requested ML instances......
2021-11-15 14:04:38 Starting - Preparing the instances for training.........
2021-11-15 14:06:19 Downloading - Downloading input data...
2021-11-15 14:06:42 Training - Downloading the training image...
2021-11-15 14:07:17 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input 

In [28]:
# simple model on combined_csv_v2
V2_region = boto3.Session().region_name    
V2_container = get_image_uri(V2_region, 'xgboost', repo_version='1.0-1')
V2_role = sagemaker.get_execution_role() 

V2_xgb = Estimator(V2_container,
    role=V2_role, 
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, V2_prefix))

V2_xgb.set_hyperparameters(objective='binary:logistic', num_round=200, early_stopping_rounds=10)

training_data_v2_channel   = sagemaker.s3_input(s3_data=training_data_v2_path, content_type='text/csv')
validation_data_v2_channel = sagemaker.s3_input(s3_data=validation_data_v2_path, content_type='text/csv')

xgb_data = {'train': training_data_v2_channel, 'validation': validation_data_v2_channel}

# Fit the models
V2_xgb.fit(xgb_data)

##################################
timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = 'xgb-demo-'+timestamp
print(endpoint_name)

V2_xgb_predictor = V2_xgb.deploy(endpoint_name=endpoint_name, 
                        initial_instance_count=1, 
                        instance_type='ml.m4.xlarge')


V2_xgb_predictor.content_type = 'text/csv'
V2_xgb_predictor.serializer = csv_serializer
V2_xgb_predictor.deserializer = csv_deserializer

v2_xgb_response = V2_xgb_predictor.predict(testing_dataset_v2)

print(" ")
print(v2_xgb_response)

2021-11-15 14:23:39 Starting - Starting the training job...
2021-11-15 14:23:48 Starting - Launching requested ML instances......
2021-11-15 14:24:49 Starting - Preparing the instances for training.........
2021-11-15 14:26:35 Downloading - Downloading input data...
2021-11-15 14:27:00 Training - Downloading the training image.....[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[14:27:43] 26709x85 matrix with 2270265 entries loaded from /opt/ml/inp

In [None]:
# Write the final comments here and turn the cell type into markdown