# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

## Reading data

This notebook focuses on the first dataset (`combined_csv_v2.csv`) which will be trained on a linear model and a XGBoost model. The results will be compared directly.

Due to the 2h constrains of the AWS environments, this notebook is duplicated to run dataset 2 (`combined_csv_v1.csv`), `which is not convered on this notebook`

In [7]:
import pandas as pd
df_1 = pd.read_csv("combined_csv_v1.csv", index_col=0)
df_2 = pd.read_csv("combined_csv_v2.csv", index_col=0)

In [11]:
df_1.drop_duplicates(inplace=True)
df_1

Unnamed: 0,target,Distance,Quarter_1,Quarter_2,Quarter_3,Quarter_4,Month_1,Month_2,Month_3,Month_4,...,DepHourofDay_14,DepHourofDay_15,DepHourofDay_16,DepHourofDay_17,DepHourofDay_18,DepHourofDay_19,DepHourofDay_20,DepHourofDay_21,DepHourofDay_22,DepHourofDay_23
0,1.0,1587.0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
1,0.0,1587.0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0.0,602.0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1.0,602.0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,0.0,602.0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1658125,1.0,606.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
1658126,1.0,1199.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1658127,1.0,1199.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1658128,0.0,1947.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
df_2.drop_duplicates(inplace=True)
df_2

Unnamed: 0,target,Distance,AWND_O,PRCP_O,TAVG_O,AWND_D,PRCP_D,TAVG_D,SNOW_O,SNOW_D,...,DepHourofDay_14,DepHourofDay_15,DepHourofDay_16,DepHourofDay_17,DepHourofDay_18,DepHourofDay_19,DepHourofDay_20,DepHourofDay_21,DepHourofDay_22,DepHourofDay_23
0,1.0,1587.0,20,0,206.0,38,0,134.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1587.0,20,0,206.0,38,0,134.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,602.0,20,0,206.0,51,0,79.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,602.0,20,0,206.0,51,0,79.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,602.0,20,0,206.0,51,0,79.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205727,0.0,226.0,13,0,190.0,19,0,164.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
205728,0.0,226.0,19,0,164.0,13,0,190.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
205729,0.0,226.0,13,0,190.0,19,0,164.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
205730,0.0,689.0,18,0,232.0,13,0,190.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Define helper function to train deploy and test on AWS

In [13]:
# Import require libs
import sagemaker
from sagemaker import get_execution_role
from sagemaker.xgboost import XGBoost
import boto3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [14]:
# Set up SageMaker session and role and bucket 
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Specify your S3 bucket and data location
## Get available bucket name
s3_bucket = [bucket['Name'] for bucket in boto3.client('s3').list_buckets()['Buckets']][0]


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


The `train_deploy_test` capture entires pipeline of train, validate, deploy and test (batch processing) and report model performance

As it is required to run the same pipeline on 2 datasets and 2 training algorithms, it is easier to using this function to run for 4 scenarios with different parameters

To train a model, 4 params are required


In [15]:
def train_deploy_test(data, estimator_name, bucket, prefix):
    """
        This function helps to run a pipeline on multiple datasets and estimator name
        Params:
            da,ta: a dataframe containing the first column as target variable and indicators as the rest
            estimator_name: 'xgboost' or 'linear-learner' - Name of estimator suitable with AWS specs
            bucket: a AWS bucket that can be used to upload data
            prefix: a AWS folder to store uploaded data, due to using the same function for 4 different run, 
                    it is expected that prefix would be different for each to avoid data conflict
    """
    
    print(f"Start train_deploy_test for {estimator_name} and store at s3://{bucket}/{prefix}")

    # Handle missing values if necessary
    data = data.dropna()

    # Split the dataset into features (X) and target (y)
    X = data.drop('target', axis=1)  # Features
    y = (data['target']).astype(int)  # Binary classification: 1 for delay, 0 for no delay

    # Encode categorical features if needed (e.g., using one-hot encoding)
    X = pd.get_dummies(X)

    # Split the data into training, validation, and testing sets (70% - 15% - 15%)
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

    # Create RecordSets for the training, validation, and testing data
    train_data = pd.concat([y_train, X_train], axis=1)
    val_data = pd.concat([y_val, X_val], axis=1)
    test_data = pd.concat([y_test, X_test], axis=1)

    # Save train and validation data to support for the training process
    train_data.to_csv('train_data.csv', header=False, index=False)
    val_data.to_csv('val_data.csv', header=False, index=False)
    
    # Save testing data for supporting unknown testing. 
    # Due to the model prediction without a target column, 
    # the target column from test data is removed to work with the prediction phase later
    test_data.drop(columns=["target"]).to_csv('test_data.csv', header=False, index=False)

    # Upload data to s3 ready for training and for later testing
    s3_train_data = sagemaker_session.upload_data('train_data.csv', bucket=bucket, key_prefix=prefix)
    s3_val_data = sagemaker_session.upload_data('val_data.csv', bucket=bucket, key_prefix=prefix)
    s3_test_data = sagemaker_session.upload_data('test_data.csv', bucket=bucket, key_prefix=prefix)

    # Create RecordSets for the training, validation, and not for testing data
    train_recordset = sagemaker.inputs.TrainingInput(s3_data=s3_train_data, content_type='text/csv')
    val_recordset = sagemaker.inputs.TrainingInput(s3_data=s3_val_data, content_type='text/csv')

    # Train an classification model
    estimator = sagemaker.estimator.Estimator(
        role=role,
        image_uri=sagemaker.image_uris.retrieve(estimator_name, boto3.Session().region_name, version='latest'),
        instance_count=1,
        instance_type='ml.m4.xlarge',
        sagemaker_session=sagemaker_session,
        disable_profiler=True
    )
    # Identify params suitable for a provided estimator name
    print(f"Setting params for {estimator_name}")
    if estimator_name =='xgboost':
        estimator.set_hyperparameters(
            objective="binary:logistic",
            num_round=100,
            max_depth=5,
            eta=0.2,
            alpha=0.1
        )
    elif estimator_name == 'linear-learner':
        # Train the model on the RecordSet
        estimator.set_hyperparameters(
            predictor_type="binary_classifier",
            mini_batch_size=100,
            epochs=10
        )

    # Start training the model with training set and validate using validation set
    print(f'Start fitting : {estimator_name}')
    estimator.fit({'train': train_recordset, 'validation': val_recordset})

    # Deploy the trained model on another SageMaker instance
    # The trained model is ready for both endpoint execution and batch processing
    predictor = estimator.deploy(initial_instance_count=1, 
                                       instance_type='ml.m4.xlarge', 
                                       endpoint_name=estimator.latest_training_job.name,
                                        model_name=estimator.latest_training_job.name
                                      )
    # Create a batch transform job for testing data
    print('Create a batch transform job for testing data')
    transformer = estimator.transformer(instance_count=1, 
                                      instance_type='ml.m4.xlarge', 
                                      accept='text/csv')
    
    # when testing, make sure test data won't have a target column to match with X structure
    transformer.transform(data=s3_test_data, content_type='text/csv', split_type='Line')
    transformer.wait()

    # Download the results from S3
    print('Download the results from S3')
    s3_output_path = transformer.output_path
    output_files = sagemaker.s3.S3Downloader.list(s3_output_path)

    results = np.array([sagemaker.s3.S3Downloader.read_file(file).split() for file in output_files])\
    .astype(float)

    # Calculate the number of records in the transformed data
    num_records = len(results[0])
    print(f'Obtain {num_records} from transformed data')

    # Evaluate the model performance
    print("Evaluate the model performance")
    y_pred = [int(result) for result in results[0]]
    accuracy = accuracy_score(y_test, y_pred)
    confusion = confusion_matrix(y_test, y_pred)
    classification_report_str = classification_report(y_test, y_pred, target_names=['No Delay', 'Delay'])

    print(f'Number of Records: {num_records}')
    print(f'Accuracy: {accuracy:.4f}')
    print('Confusion Matrix:')
    print(confusion)
    print('Classification Report:')
    print(classification_report_str)


# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [None]:
# Define parameters to train a 'linear-learner'
dataset = df_1
estimator_name = 'linear-learner'
s3_prefix = f'flight-delay-prediction-{estimator_name}-dataset1'
train_deploy_test(dataset,estimator_name,s3_bucket,s3_prefix)

Start train_deploy_test for linear-learner and store at s3://c94466a2114432l5153128t1w680590409226-labbucket-1igwq1eolbjw9/flight-delay-prediction-linear-learner-dataset1


INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


Setting params for linear-learner
Start fitting : linear-learner


INFO:sagemaker:Creating training-job with name: linear-learner-2023-11-02-14-06-00-515


2023-11-02 14:06:00 Starting - Starting the training job.....

In [None]:
# Define parameters to train a 'linear-learner'
dataset = df_2
estimator_name = 'linear-learner'
s3_prefix = f'flight-delay-prediction-{estimator_name}-dataset2'
train_deploy_test(dataset,estimator_name,s3_bucket,s3_prefix)

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [None]:
dataset = df_1
estimator_name = 'xgboost' 
s3_prefix = f'flight-delay-prediction-{estimator_name}-dataset1'
train_deploy_test(dataset,estimator_name,s3_bucket,s3_prefix)

In [None]:
dataset = df_2
estimator_name = 'xgboost' 
s3_prefix = f'flight-delay-prediction-{estimator_name}-dataset2'
train_deploy_test(dataset,estimator_name,s3_bucket,s3_prefix)

In [None]:
# Write the final comments here and turn the cell type into markdown