# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

## CSV 1

In [47]:
# Write the code here and add cells as you need
import numpy as np
import pandas as pd
import zipfile

In [48]:
import zipfile
import pandas as pd
import os

# The uploaded zip file name
zip_file_name = 'comb.zip'

# The expected csv file name inside the zip
# You may need to adjust this if the CSV file has a different name
csv_file_name = 'combined_csv_v1.csv'

# The directory to extract the files to (it will be created if it doesn't exist)
extract_dir = 'extracted_files'

# Ensure the directory exists
if not os.path.exists(extract_dir):
    os.makedirs(extract_dir)

# Unzip the file and extract its contents
with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Construct the path to the CSV file
csv_file_path = os.path.join(extract_dir, csv_file_name)

# Now, you can read the CSV file using pandas
data = pd.read_csv(csv_file_path, , nrows = 1000000)
data.head()

In [50]:
# Split the data into 70% train, 15% validation, and 15% test
train_frac = 0.70
val_frac = 0.15 / (1 - train_frac)  # Calculate fraction of remaining data

# Splitting the datasets
training_dataset = data.sample(frac=train_frac, random_state=59)
remaining_dataset = data.loc[~data.index.isin(training_dataset.index), :]

validation_dataset = remaining_dataset.sample(frac=val_frac, random_state=59)
testing_dataset = remaining_dataset.loc[~remaining_dataset.index.isin(validation_dataset.index), :]


In [51]:
print(data.shape)
print(training_dataset.shape)
print(validation_dataset.shape)
print(testing_dataset.shape)

(1000000, 72)
(700000, 72)
(150000, 72)
(150000, 72)


In [52]:
# # Writing files to a file for S3 later
# training_dataset.to_csv('training_dataset.csv', index=False, header=False)
# testing_dataset.to_csv('testing_dataset.csv',index=False, header=False)
# validation_dataset.to_csv('validation_dataset.csv', index=False, header=False)

In [53]:
import sagemaker
import boto3
from sagemaker import TrainingInput
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator
import pandas as pd
from sklearn.model_selection import train_test_split
import io
# Intitialising sagemaker
# SageMaker session and default S3 bucket
session = sagemaker.Session()
default_bucket = session.default_bucket()
data_prefix = 'model-data'

# Define AWS region and container for Linear Learner
aws_region = boto3.Session().region_name
linear_container = get_image_uri(aws_region, 'linear-learner')


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [54]:
# Pre-processing and uloading to S3
# Splitting and saving data
def process_and_upload_data(dataframe, file_name, s3_prefix):
    dataframe.to_csv(file_name, header=False, index=False)
    return session.upload_data(file_name, key_prefix=s3_prefix)

train_path = process_and_upload_data(training_dataset, 'training_data.csv', f"{data_prefix}/train")
valid_path = process_and_upload_data(validation_dataset, 'validation_data.csv', f"{data_prefix}/valid")
test_path = process_and_upload_data(testing_dataset, 'testing_data.csv', f"{data_prefix}/test")


In [55]:
# Model Configuration and training
sm_role = sagemaker.get_execution_role()
linear_config = {
    'image_uri': linear_container,
    'role': sm_role,
    'instance_count': 1,
    'instance_type': 'ml.m5.large',
    'output_path': f's3://{default_bucket}/{data_prefix}/output'
}

linear_model = Estimator(**linear_config)
linear_model.set_hyperparameters(predictor_type='binary_classifier', mini_batch_size=1000, epochs=3)

# Data channels
channels = {
    'train': TrainingInput(train_path, content_type='text/csv'),
    'validation': TrainingInput(valid_path, content_type='text/csv')
}

linear_model.fit(channels)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: linear-learner-2023-10-31-20-26-34-831


2023-10-31 20:26:35 Starting - Starting the training job...
2023-10-31 20:26:50 Starting - Preparing the instances for training......
2023-10-31 20:27:59 Downloading - Downloading input data......
2023-10-31 20:28:49 Training - Downloading the training image......
2023-10-31 20:29:55 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[10/31/2023 20:30:00 INFO 139963066464064] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', 'opt

In [57]:
# Prepare test data for batch transform
batch_input = testing_dataset.drop(columns=['target'])  
batch_input.to_csv('batch_input.csv', index=False, header=False)
batch_input_path = session.upload_data('batch_input.csv', key_prefix=f"{data_prefix}/batch_input")

# Batch transform settings and execution
output_location = f's3://{default_bucket}/{data_prefix}/batch_output'
transformer = linear_model.transformer(instance_count=1, instance_type='ml.m5.xlarge', output_path=output_location)
transformer.transform(batch_input_path, content_type='text/csv')
transformer.wait()


INFO:sagemaker:Creating model with name: linear-learner-2023-10-31-20-33-16-713
INFO:sagemaker:Creating transform job with name: linear-learner-2023-10-31-20-33-17-303


ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTransformJob operation: The account-level service limit 'ml.m5.xlarge for transform job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

In [None]:
# Fetch results from S3
s3_client = boto3.client('s3')
result_file = s3_client.get_object(Bucket=default_bucket, Key=f"{data_prefix}/batch_output/batch_input.csv.out")
predicted_data = pd.read_csv(io.BytesIO(result_file['Body'].read()), header=None, names=['Predicted'])
print(predicted_data.head())


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import confusion_matrix
test_labels = testing_dataset.loc[:, 'target'].values

# Function for plotting confusion matrx
def plot_confusion_matrix(test_labels, predicted_data, title='Confusion Matrix for 1st csv'):
    conf_matrix = confusion_matrix(test_labels, target_predicted)
    plt.figure(figsize=(6, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Oranges', 
                xticklabels=['No Delay', 'Delay'], 
                yticklabels=['No Delay', 'Delay'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(title)
    plt.show()
    

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_labels, predicted_data))


## CSV 2

In [None]:
dataset = pd.read_csv('combined_csv_v2.csv', nrows=1000000)
dataset.head()

In [None]:
# Split the data into 70% train, 15% validation, and 15% test
train_frac = 0.70
val_frac = 0.15 / (1 - train_frac)  # Calculate fraction of remaining data

# Splitting the datasets
training_dataset = data.sample(frac=train_frac, random_state=59)
remaining_dataset = data.loc[~data.index.isin(training_dataset.index), :]

validation_dataset = remaining_dataset.sample(frac=val_frac, random_state=59)
testing_dataset = remaining_dataset.loc[~remaining_dataset.index.isin(validation_dataset.index), :]

In [None]:
print(data.shape)
print(training_dataset.shape)
print(validation_dataset.shape)
print(testing_dataset.shape)

In [None]:
# Intitialising sagemaker
# SageMaker session and default S3 bucket
session = sagemaker.Session()
default_bucket = session.default_bucket()
data_prefix = 'model-data'

# Define AWS region and container for Linear Learner
aws_region = boto3.Session().region_name
linear_container = get_image_uri(aws_region, 'linear-learner')


In [None]:
# Pre-processing and uloading to S3
# Splitting and saving data
def process_and_upload_data(dataframe, file_name, s3_prefix):
    dataframe.to_csv(file_name, header=False, index=False)
    return session.upload_data(file_name, key_prefix=s3_prefix)

train_path = process_and_upload_data(training_dataset, 'training_data.csv', f"{data_prefix}/train")
valid_path = process_and_upload_data(validation_dataset, 'validation_data.csv', f"{data_prefix}/valid")
test_path = process_and_upload_data(testing_dataset, 'testing_data.csv', f"{data_prefix}/test")


In [None]:
# Model Configuration and training
sm_role = sagemaker.get_execution_role()
linear_config = {
    'image_uri': linear_container,
    'role': sm_role,
    'instance_count': 1,
    'instance_type': 'ml.m5.large',
    'output_path': f's3://{default_bucket}/{data_prefix}/output'
}

linear_model = Estimator(**linear_config)
linear_model.set_hyperparameters(predictor_type='binary_classifier', mini_batch_size=1000, epochs=3)

# Data channels
channels = {
    'train': TrainingInput(train_path, content_type='text/csv'),
    'validation': TrainingInput(valid_path, content_type='text/csv')
}

linear_model.fit(channels)


In [None]:
# Prepare test data for batch transform
batch_input = testing_dataset.drop(columns=['target'])  
batch_input.to_csv('batch_input.csv', index=False, header=False)
batch_input_path = session.upload_data('batch_input.csv', key_prefix=f"{data_prefix}/batch_input")

# Batch transform settings and execution
output_location = f's3://{default_bucket}/{data_prefix}/batch_output'
transformer = linear_model.transformer(instance_count=1, instance_type='ml.m5.xlarge', output_path=output_location)
transformer.transform(batch_input_path, content_type='text/csv')
transformer.wait()


In [None]:
# Fetch results from S3
s3_client = boto3.client('s3')
result_file = s3_client.get_object(Bucket=default_bucket, Key=f"{data_prefix}/batch_output/batch_input.csv.out")
predicted_data = pd.read_csv(io.BytesIO(result_file['Body'].read()), header=None, names=['Predicted'])
print(predicted_data.head())


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import confusion_matrix
test_labels = testing_dataset.loc[:, 'target'].values

# Function for plotting confusion matrx
def plot_confusion_matrix(test_labels, predicted_data, title='Confusion Matrix for 1st csv'):
    conf_matrix = confusion_matrix(test_labels, target_predicted)
    plt.figure(figsize=(6, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Oranges', 
                xticklabels=['No Delay', 'Delay'], 
                yticklabels=['No Delay', 'Delay'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(title)
    plt.show()

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_labels, predicted_data))

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

## CSV 1 - XGB


In [None]:
dataset = pd.read_csv('combined_csv_v1.csv', nrows=1000000)

In [None]:
import sagemaker
import boto3
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost
import pandas as pd
from sklearn.model_selection import train_test_split
import io

# SageMaker session and default S3 bucket
session = sagemaker.Session()
default_bucket = session.default_bucket()
data_prefix = 'ensemble-model-data'

# Define AWS region and XGBoost container
aws_region = boto3.Session().region_name
xgboost_container = sagemaker.image_uris.retrieve("xgboost", aws_region, "latest")


In [None]:
# Data preprocessing 
def split_and_upload_dataset(dataset, prefix):
    # Split dataset
    train_df, temp_df = train_test_split(dataset, test_size=0.30, random_state=0)
    valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=0)
    
    # Save and upload to S3
    train_path = process_and_upload_data(train_df, 'train.csv', f"{prefix}/train")
    valid_path = process_and_upload_data(valid_df, 'valid.csv', f"{prefix}/valid")
    test_path = process_and_upload_data(test_df, 'test.csv', f"{prefix}/test")
    
    return train_path, valid_path, test_path

def process_and_upload_data(dataframe, file_name, s3_prefix):
    dataframe.to_csv(file_name, header=False, index=False)
    return session.upload_data(file_name, key_prefix=s3_prefix)

# Assuming you've loaded your two combined datasets into variables named 'dataset1' and 'dataset2'
train1_path, valid1_path, test1_path = split_and_upload_dataset(dataset1, f"{data_prefix}/dataset1")
train2_path, valid2_path, test2_path = split_and_upload_dataset(dataset2, f"{data_prefix}/dataset2")


In [None]:
def train_and_transform(data_paths, data_prefix):
    # Configure XGBoost Estimator
    xgb = XGBoost(
        entry_point="your_entry_point_script.py",
        framework_version="1.0-1",
        container=xgboost_container,
        role=sagemaker.get_execution_role(),
        instance_count=1,
        instance_type='ml.m5.xlarge',
        output_path=f's3://{default_bucket}/{data_prefix}/output'
    )
    
    # Train model
    data_channels = {
        'train': TrainingInput(data_paths[0], content_type='csv'),
        'validation': TrainingInput(data_paths[1], content_type='csv')
    }
    xgb.fit(data_channels)
    
    # Batch transform
    transformer = xgb.transformer(instance_count=1, instance_type='ml.m5.large')
    transformer.transform(data_paths[2], content_type='text/csv', split_type='Line')
    transformer.wait()

# Train and transform for both datasets
train_and_transform([train1_path, valid1_path, test1_path], f"{data_prefix}/dataset1")
train_and_transform([train2_path, valid2_path, test2_path], f"{data_prefix}/dataset2")


In [None]:
# Fbunction to retrieve batch transform results from S3
def get_batch_transform_results(bucket, batch_output_path):
    s3_client = boto3.client('s3')
    
    # Get the file from the S3 bucket
    s3_response_object = s3_client.get_object(Bucket=bucket, Key=batch_output_path + 'batch-in.csv.out')
    object_content = s3_response_object['Body'].read().decode('utf-8')
    
    # Convert the string content to a DataFrame
    results_df = pd.read_csv(io.StringIO(object_content), header=None)
    
    return results_df

# Get results for the XGBoost batch transform
xgb_results = get_batch_transform_results(default_bucket, f"{data_prefix}/dataset1/batch-out/")

# Display the first few rows of the results
print(xgb_results.head())


In [None]:
true_labels = testing_dataset.iloc[:, 0].values  

# Generate the classification report
report = classification_report(true_labels, xgb_results.values)

print(report)

## CSV 2 -xgb

In [None]:
dataset= pd.read_csv(('combined_csv_v2.csv', nrows=1000000)

In [None]:
import sagemaker
import boto3
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost
import pandas as pd
from sklearn.model_selection import train_test_split
import io

# SageMaker session and default S3 bucket
session = sagemaker.Session()
default_bucket = session.default_bucket()
data_prefix = 'ensemble-model-data'

# Define AWS region and XGBoost container
aws_region = boto3.Session().region_name
xgboost_container = sagemaker.image_uris.retrieve("xgboost", aws_region, "latest")


In [None]:
# Data preprocessing 
def split_and_upload_dataset(dataset, prefix):
    # Split dataset
    train_df, temp_df = train_test_split(dataset, test_size=0.30, random_state=0)
    valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=0)
    
    # Save and upload to S3
    train_path = process_and_upload_data(train_df, 'train.csv', f"{prefix}/train")
    valid_path = process_and_upload_data(valid_df, 'valid.csv', f"{prefix}/valid")
    test_path = process_and_upload_data(test_df, 'test.csv', f"{prefix}/test")
    
    return train_path, valid_path, test_path

def process_and_upload_data(dataframe, file_name, s3_prefix):
    dataframe.to_csv(file_name, header=False, index=False)
    return session.upload_data(file_name, key_prefix=s3_prefix)

# Assuming you've loaded your two combined datasets into variables named 'dataset1' and 'dataset2'
train1_path, valid1_path, test1_path = split_and_upload_dataset(dataset1, f"{data_prefix}/dataset1")
train2_path, valid2_path, test2_path = split_and_upload_dataset(dataset2, f"{data_prefix}/dataset2")


In [None]:
def train_and_transform(data_paths, data_prefix):
    # Configure XGBoost Estimator
    xgb = XGBoost(
        entry_point="your_entry_point_script.py",
        framework_version="1.0-1",
        container=xgboost_container,
        role=sagemaker.get_execution_role(),
        instance_count=1,
        instance_type='ml.m5.xlarge',
        output_path=f's3://{default_bucket}/{data_prefix}/output'
    )
    
    # Train model
    data_channels = {
        'train': TrainingInput(data_paths[0], content_type='csv'),
        'validation': TrainingInput(data_paths[1], content_type='csv')
    }
    xgb.fit(data_channels)
    
    # Batch transform
    transformer = xgb.transformer(instance_count=1, instance_type='ml.m5.large')
    transformer.transform(data_paths[2], content_type='text/csv', split_type='Line')
    transformer.wait()

# Train and transform for both datasets
train_and_transform([train1_path, valid1_path, test1_path], f"{data_prefix}/dataset1")
train_and_transform([train2_path, valid2_path, test2_path], f"{data_prefix}/dataset2")


In [None]:
# Fbunction to retrieve batch transform results from S3
def get_batch_transform_results(bucket, batch_output_path):
    s3_client = boto3.client('s3')
    
    # Get the file from the S3 bucket
    s3_response_object = s3_client.get_object(Bucket=bucket, Key=batch_output_path + 'batch-in.csv.out')
    object_content = s3_response_object['Body'].read().decode('utf-8')
    
    # Convert the string content to a DataFrame
    results_df = pd.read_csv(io.StringIO(object_content), header=None)
    
    return results_df

# Get results for the XGBoost batch transform
xgb_results = get_batch_transform_results(default_bucket, f"{data_prefix}/dataset1/batch-out/")

# Display the first few rows of the results
print(xgb_results.head())


In [None]:
true_labels = testing_dataset.iloc[:, 0].values  

# Generate the classification report
report = classification_report(true_labels, xgb_results.values)

print(report)

In [None]:
# Write the final comments here and turn the cell type into markdown