## Data Science Technology and Systems

### Master of Data Science
### U3241627 - Bharath Shivakumar

# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [None]:
import pandas as pd
from scipy.io import arff

import os
import io
import re
import os
import boto3
import sagemaker
from sagemaker.image_uris import retrieve
from sagemaker import get_execution_role
from sklearn.model_selection import train_test_split

In [None]:
role = get_execution_role()

In [None]:
#We load the data
data = pd.read_csv('combined_csv_v1.csv')

In [None]:
#We check the shape of the data
data.shape

In [None]:
#We check to see if there are any NA values in the data
data.isna().sum()

In [None]:
#We do see some negligible amount of NA values, let's drop them
data = data.dropna()

In [None]:
#We check the NA values again
data.isna().sum()

### Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
#We split the data in train, test and validation accordingly.
train_data, test_data = train_test_split(data, test_size = 0.3, random_state= 42)
validation_data, test_data = train_test_split(test_data, test_size = 0.5, random_state = 42)

In [None]:
#Now let us understand these 3 variables
print(train_data.shape)
print(test_data.shape)
print(validation_data.shape)

In [None]:
#We save the data
train_data.to_csv("train_data_v1.csv", index = False, header = False)
validation_data.to_csv("validation_data_v1.csv", index = False, header = False)
test_data_val = test_data.drop('target', axis = 1)

In [None]:
#We create the sagemaker session and the region
sess_sage = sagemaker.Session()
region = boto3.Session().region_name

##We create a default bucket
bucket = sess_sage.default_bucket()
prefix = "sagemaker/oncloud"

#We define the flies that are going to be uploaded
train_file='train_data_v1.csv'
test_file='test_data_v1.csv'
validate_file='validate_data_v1.csv'

#We create a function to upload flies to s3 bucket
s3_resource = boto3.Session().resource('s3')
def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

#We upload the files
upload_s3_csv(train_file, 'train', train_data)
upload_s3_csv(test_file, 'test', test_data)
upload_s3_csv(validate_file, 'validate', validation_data)

### Use linear learner estimator to build a classifcation model.

In [None]:
#Importing libraries
from sagemaker.estimator import Estimator

#We setup a container
container = sagemaker.image_uris.retrieve("linear-learner", region)

#We define the classifier
classifier1 = Estimator(
    container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=f's3://{bucket}/{prefix}/output'
)

In [None]:
#Importing libraries
from sagemaker import TrainingInput
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

#We define the hyperparameter
hyperparameters = {
    "predictor_type": "binary_classifier",
    "mini_batch_size": 100,
    "epochs": 3
}


#We define the input for the classifier
data_train = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

data_validation = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {
    "train": data_train,
    "validation": data_validation
}

In [None]:
#We fit the model
classifier1.set_hyperparameters(**hyperparameters)
classifier1.fit(inputs = data_channels)

### Host the model on another instance

In [None]:
from time import gmtime, strftime

#Deploying the model
time_stamp = strftime('%d-%H-%M-%S', gmtime())
endpoint_name = f'linear-learner-demo-{time_stamp}'
print(endpoint_name)
predictor = classifier1.deploy(endpoint_name=endpoint_name, initial_instance_count=1, instance_type='ml.m4.xlarge', 
                               serializer=sagemaker.serializers.CSVSerializer())

In [None]:
result = predictor.predict(test_data)
result

### Perform batch transform to evaluate the model on testing data

In [None]:
# Load the test dataset without the target column
batch_test = testing_dataset.drop(columns=["target"])

# Save the modified dataset
batch_test.to_csv('batch-in.csv', index=False, header=False)

# Upload the dataset to S3 for batch processing
s3_batch_test_path = sess.upload_data(path='batch-in.csv', key_prefix=f'{prefix}/input/testing')
print(s3_batch_test_path)

# Define the batch output path in S3
s3_batch_output_path = f's3://{s3_bucket}/{prefix}/batch-out/'
print(s3_batch_output_path)

In [None]:
# we create a transformer object from the trained model for batch processing
llnearmodel_transform = model.transformer(instance_count=1,
                                   instance_type='ml.m4.xlarge',
                                   strategy='MultiRecord',
                                   assemble_with='Line',
                                   output_path=s3_batch_output_path)

# Start the batch transform job
linearmodel_transform.transform(data=s3_batch_test_path,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')

# Wait for the batch transform job to complete
llnearmodel_transform.wait()

In [None]:
#We create the confusion matrix

#We import the libraries
from sklearn.metrics import confusion_matrix

#We create the confusion matrix
matrix = confusion_matrix(test_labels, target_predicted_binary)
confusion_mat = pd.DataFrame(matrix, index=['Delayed','Not_Delayed'],columns=['Delayed','Not_Delayed'])

confusion_mat

In [None]:
#We plot the confusion matrix
import seaborn as sns
import matplotlib.pyplot as plt

colormap = sns.color_palette("BrBG", 10)
sns.heatmap(df_confusion, annot=True, cbar=None, cmap=colormap)
plt.title("Confusion Matrix")
plt.tight_layout()
plt.ylabel("True Class")
plt.xlabel("Predicted Class")
plt.show()

### Report the performance metrics that you see better test the model performance

In [None]:
#To start, extract the values from the confusion matrix cells into variables.
from sklearn.metrics import roc_auc_score, roc_curve, auc

TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted_binary).ravel()

print(f"True Negative (TN) : {TN}")
print(f"False Positive (FP): {FP}")
print(f"False Negative (FN): {FN}")
print(f"True Positive (TP) : {TP}")

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
Sensitivity  = float(TP)/(TP+FN)*100
print(f"Sensitivity or TPR: {Sensitivity}%")  
print(f"There is a {Sensitivity}% chance of detecting patients with an abnormality have an abnormality")

# Specificity or true negative rate
Specificity  = float(TN)/(TN+FP)*100
print(f"Specificity or TNR: {Specificity}%") 
print(f"There is a {Specificity}% chance of detecting normal patients are normal.")

In [None]:
# Precision or positive predictive value
Precision = float(TP)/(TP+FP)*100
print(f"Precision: {Precision}%")  
print(f"You have an abnormality, and the probablity that is correct is {Precision}%")

# Negative predictive value
NPV = float(TN)/(TN+FN)*100
print(f"Negative Predictive Value: {NPV}%") 
print(f"You don't have an abnormality, but there is a {NPV}% chance that is incorrect" )

# Fall out or false positive rate
FPR = float(FP)/(FP+TN)*100
print( f"False Positive Rate: {FPR}%") 
print( f"There is a {FPR}% chance that this positive result is incorrect.")

# False negative rate
FNR = float(FN)/(TP+FN)*100
print(f"False Negative Rate: {FNR}%") 
print(f"There is a {FNR}% chance that this negative result is incorrect.")

# False discovery rate
FDR = float(FP)/(TP+FP)*100
print(f"False Discovery Rate: {FDR}%" )
print(f"You have an abnormality, but there is a {FDR}% chance this is incorrect.")

In [None]:
# Overall accuracy
ACC = float(TP+TN)/(TP+FP+FN+TN)*100
print(f"Accuracy: {ACC}%") 

In [None]:
#Giving a summary
print(f"Sensitivity or TPR: {Sensitivity}%")    
print(f"Specificity or TNR: {Specificity}%") 
print(f"Precision: {Precision}%")   
print(f"Negative Predictive Value: {NPV}%")  
print( f"False Positive Rate: {FPR}%") 
print(f"False Negative Rate: {FNR}%")  
print(f"False Discovery Rate: {FDR}%" )
print(f"Accuracy: {ACC}%") 

### This code is for the second dataset.

In [None]:
role = get_execution_role()

In [None]:
#We load the data
data = pd.read_csv('combined_csv_v2.csv')

In [None]:
#We check to see if there are any NA values in the data
data.isna().sum()

In [None]:
#We do see some negligible amount of NA values, let's drop them
data = data.dropna()

### Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
#We split the data in train, test and validation accordingly.
train_data, test_data = train_test_split(data, test_size = 0.3, random_state= 42)
validation_data, test_data = train_test_split(test_data, test_size = 0.5, random_state = 42)

In [None]:
#Now let us understand these 3 variables
print(train_data.shape)
print(test_data.shape)
print(validation_data.shape)

In [None]:
#We save the data
train_data.to_csv("train_data_v2.csv", index = False, header = False)
validation_data.to_csv("validation_data_v2.csv", index = False, header = False)
test_data_val = test_data.drop('target', axis = 1)

In [None]:
#We create the sagemaker session and the region
sess_sage = sagemaker.Session()
region = boto3.Session().region_name

##We create a default bucket
bucket = sess_sage.default_bucket()
prefix = "sagemaker/oncloud"

#We define the flies that are going to be uploaded
train_file='train_data_v1.csv'
test_file='test_data_v1.csv'
validate_file='validate_data_v1.csv'

#We create a function to upload flies to s3 bucket
s3_resource = boto3.Session().resource('s3')
def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

#We upload the files
upload_s3_csv(train_file, 'train', train_data)
upload_s3_csv(test_file, 'test', test_data)
upload_s3_csv(validate_file, 'validate', validation_data)

### Use linear learner estimator to build a classifcation model.

In [None]:
#Importing libraries
from sagemaker.estimator import Estimator

#We setup a container
container = sagemaker.image_uris.retrieve("linear-learner", region)

#We define the classifier
classifier1 = Estimator(
    container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=f's3://{bucket}/{prefix}/output'
)

In [None]:
#Importing libraries
from sagemaker import TrainingInput
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

#We define the hyperparameter
hyperparameters = {
    "predictor_type": "binary_classifier",
    "mini_batch_size": 100,
    "epochs": 3
}


#We define the input for the classifier
data_train = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

data_validation = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {
    "train": data_train,
    "validation": data_validation
}

In [None]:
#We fit the model
classifier1.set_hyperparameters(**hyperparameters)
classifier1.fit(inputs = data_channels)

### Host the model on another instance

In [None]:
from time import gmtime, strftime

#Deploying the model
time_stamp = strftime('%d-%H-%M-%S', gmtime())
endpoint_name = f'linear-learner-demo-{time_stamp}'
print(endpoint_name)
predictor = classifier1.deploy(endpoint_name=endpoint_name, initial_instance_count=1, instance_type='ml.m4.xlarge', 
                               serializer=sagemaker.serializers.CSVSerializer())

In [None]:
result = predictor.predict(test_data)
result

### Perform batch transform to evaluate the model on testing data

In [None]:
# Load the test dataset without the target column
batch_test = testing_dataset.drop(columns=["target"])

# Save the modified dataset
batch_test.to_csv('batch-in.csv', index=False, header=False)

# Upload the dataset to S3 for batch processing
s3_batch_test_path = sess.upload_data(path='batch-in.csv', key_prefix=f'{prefix}/input/testing')
print(s3_batch_test_path)

# Define the batch output path in S3
s3_batch_output_path = f's3://{s3_bucket}/{prefix}/batch-out/'
print(s3_batch_output_path)

In [None]:
# we create a transformer object from the trained model for batch processing
llnearmodel_transform = model.transformer(instance_count=1,
                                   instance_type='ml.m4.xlarge',
                                   strategy='MultiRecord',
                                   assemble_with='Line',
                                   output_path=s3_batch_output_path)

# Start the batch transform job
linearmodel_transform.transform(data=s3_batch_test_path,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')

# Wait for the batch transform job to complete
llnearmodel_transform.wait()

In [None]:
#We create the confusion matrix

#We import the libraries
from sklearn.metrics import confusion_matrix

#We create the confusion matrix
matrix = confusion_matrix(test_labels, target_predicted_binary)
confusion_mat = pd.DataFrame(matrix, index=['Delayed','Not_Delayed'],columns=['Delayed','Not_Delayed'])

confusion_mat

In [None]:
#We plot the confusion matrix
import seaborn as sns
import matplotlib.pyplot as plt

colormap = sns.color_palette("BrBG", 10)
sns.heatmap(df_confusion, annot=True, cbar=None, cmap=colormap)
plt.title("Confusion Matrix")
plt.tight_layout()
plt.ylabel("True Class")
plt.xlabel("Predicted Class")
plt.show()

### Report the performance metrics that you see better test the model performance

In [None]:
#To start, extract the values from the confusion matrix cells into variables.
from sklearn.metrics import roc_auc_score, roc_curve, auc

TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted_binary).ravel()

print(f"True Negative (TN) : {TN}")
print(f"False Positive (FP): {FP}")
print(f"False Negative (FN): {FN}")
print(f"True Positive (TP) : {TP}")

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
Sensitivity  = float(TP)/(TP+FN)*100
print(f"Sensitivity or TPR: {Sensitivity}%")  
print(f"There is a {Sensitivity}% chance of detecting patients with an abnormality have an abnormality")

# Specificity or true negative rate
Specificity  = float(TN)/(TN+FP)*100
print(f"Specificity or TNR: {Specificity}%") 
print(f"There is a {Specificity}% chance of detecting normal patients are normal.")

In [None]:
# Precision or positive predictive value
Precision = float(TP)/(TP+FP)*100
print(f"Precision: {Precision}%")  
print(f"You have an abnormality, and the probablity that is correct is {Precision}%")

# Negative predictive value
NPV = float(TN)/(TN+FN)*100
print(f"Negative Predictive Value: {NPV}%") 
print(f"You don't have an abnormality, but there is a {NPV}% chance that is incorrect" )

# Fall out or false positive rate
FPR = float(FP)/(FP+TN)*100
print( f"False Positive Rate: {FPR}%") 
print( f"There is a {FPR}% chance that this positive result is incorrect.")

# False negative rate
FNR = float(FN)/(TP+FN)*100
print(f"False Negative Rate: {FNR}%") 
print(f"There is a {FNR}% chance that this negative result is incorrect.")

# False discovery rate
FDR = float(FP)/(TP+FP)*100
print(f"False Discovery Rate: {FDR}%" )
print(f"You have an abnormality, but there is a {FDR}% chance this is incorrect.")

In [None]:
# Overall accuracy
ACC = float(TP+TN)/(TP+FP+FN+TN)*100
print(f"Accuracy: {ACC}%") 

In [None]:
#Giving a summary
print(f"Sensitivity or TPR: {Sensitivity}%")    
print(f"Specificity or TNR: {Specificity}%") 
print(f"Precision: {Precision}%")   
print(f"Negative Predictive Value: {NPV}%")  
print( f"False Positive Rate: {FPR}%") 
print(f"False Negative Rate: {FNR}%")  
print(f"False Discovery Rate: {FDR}%" )
print(f"Accuracy: {ACC}%") 

In [None]:
# Write the final comments here and turn the cell type into markdown

# Step 3: Build and evaluate ensemble models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [None]:
#We load the data
data = pd.read_csv('combined_csv_v1.csv')

In [None]:
#As we saw earlier, the data had some NA values in and we have directly removed it.
data = data.dropna()

### Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
#We split the data in train, test and validation accordingly.
train_data, test_data = train_test_split(data, test_size = 0.3, random_state= 42)
validation_data, test_data = train_test_split(test_data, test_size = 0.5, random_state = 42)

In [None]:
#Now let us understand these 3 variables
print(train_data.shape)
print(test_data.shape)
print(validation_data.shape)

In [None]:
#We create the sagemaker session and the region
sess_sage = sagemaker.Session()
region = boto3.Session().region_name

##We create a default bucket
bucket = sess_sage.default_bucket()
prefix = "sagemaker/oncloud"

#We define the flies that are going to be uploaded
train_file='train_data_v1.csv'
test_file='test_data_v1.csv'
validate_file='validate_data_v1.csv'

#We create a function to upload flies to s3 bucket
s3_resource = boto3.Session().resource('s3')
def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

In [None]:
#We upload the files
upload_s3_csv(train_file, 'train', train_data)
upload_s3_csv(test_file, 'test', test_data)
upload_s3_csv(validate_file, 'validate', validation_data)

### Use xgboost estimator to build a classifcation model.

In [None]:
#We import necessary libraries
import boto3
from sagemaker.image_uris import retrieve
from sagemaker.amazon.amazon_estimator import get_image_uri

#We create a container
container = get_image_uri(region, 'xgboost', repo_version='1.0-1')

In [None]:
#We import necessary libraries
import sagemaker
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer


#We set an output location
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)

#We define our xg boost model
xgboost_model=sagemaker.estimator.Estimator(container, 
                                  role=role,
                                  train_instance_count=1,
                                  train_instance_type='ml.m4.xlarge',
                                  output_path='s3://{}/{}/output'.format(bucket, prefix))

xgboost_model.set_hyperparameters(objective='multi:softmax',
                                  num_class=2,
                                  num_round=10,
                                  early_stopping_rounds=5)


#We define the input for the classifier
data_train = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

data_validation = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {
    "train": data_train,
    "validation": data_validation
}


In [None]:
#We fit the model
xgboost_model.fit(inputs=data_channels, logs=False)

### Host the model on another instance

In [None]:
#We deploy the model
xgboost_predictor = xgboost_model.deploy(initial_instance_count=1, serializer = sagemaker.serializers.CSVSerializer(), 
                                 instance_type='ml.m4.xlarge')

In [None]:
row = test_data.iloc[0:1,1:]
row.head()

batch_buffer = io.StringIO()
row.to_csv(batch_buffer, header=False, index=False)
test_row = batch_buffer.getvalue()
print(test_row)

In [None]:
#We predict the test values
xgboost_predictor.predict(test_row)

In [None]:
#We extract the test data again
batch_X = test_data.iloc[:,1:];
batch_X.head()

In [None]:
#We need to delete the end point and this is an important step
xgboost_predictor.delete_endpoint(delete_endpoint_config=True)

### Perform batch transform to evaluate the model on testing data

In [None]:
#We conduct batch processing
batch_X_file='batch-in.csv'
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

xgboost_transformer = xgboost_model.transformer(instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait()

In [None]:
#We dowload the results from S3
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])
target_predicted.head(5)

In [None]:
def binary_convert(x):
    threshold = 0.65
    if x > threshold:
        return 1
    else:
        return 0

target_predicted['binary'] = target_predicted['target'].apply(binary_convert)

print(target_predicted.head(10))
test.head(10)

test_labels = test_data.iloc[:,0]
test_labels.head()

In [None]:
#We create the confusion matrix

#We import the libraries
from sklearn.metrics import confusion_matrix

#We create the confusion matrix
matrix = confusion_matrix(test_labels, target_predicted_binary)
confusion_mat = pd.DataFrame(matrix, index=['Delayed','Not_Delayed'],columns=['Delayed','Not_Delayed'])

confusion_mat

In [None]:
#We import the libraries
import seaborn as sns
import matplotlib.pyplot as plt

#We plot the confusion matrix
colormap = sns.color_palette("virdis", 10)
sns.heatmap(confusion_mat, annot=True, cbar=None, cmap=colormap)
plt.title("Confusion Matrix")
plt.tight_layout()
plt.ylabel("True Class")
plt.xlabel("Predicted Class")
plt.show()

### Report the performance metrics that you see better test the model performance

In [None]:
#To start, extract the values from the confusion matrix cells into variables.
from sklearn.metrics import roc_auc_score, roc_curve, auc

TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted_binary).ravel()

print(f"True Negative (TN) : {TN}")
print(f"False Positive (FP): {FP}")
print(f"False Negative (FN): {FN}")
print(f"True Positive (TP) : {TP}")

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
Sensitivity  = float(TP)/(TP+FN)*100
print(f"Sensitivity or TPR: {Sensitivity}%")  
print(f"There is a {Sensitivity}% chance of detecting patients with an abnormality have an abnormality")

# Specificity or true negative rate
Specificity  = float(TN)/(TN+FP)*100
print(f"Specificity or TNR: {Specificity}%") 
print(f"There is a {Specificity}% chance of detecting normal patients are normal.")

In [None]:
# Precision or positive predictive value
Precision = float(TP)/(TP+FP)*100
print(f"Precision: {Precision}%")  
print(f"You have an abnormality, and the probablity that is correct is {Precision}%")

# Negative predictive value
NPV = float(TN)/(TN+FN)*100
print(f"Negative Predictive Value: {NPV}%") 
print(f"You don't have an abnormality, but there is a {NPV}% chance that is incorrect" )

# Fall out or false positive rate
FPR = float(FP)/(FP+TN)*100
print( f"False Positive Rate: {FPR}%") 
print( f"There is a {FPR}% chance that this positive result is incorrect.")

# False negative rate
FNR = float(FN)/(TP+FN)*100
print(f"False Negative Rate: {FNR}%") 
print(f"There is a {FNR}% chance that this negative result is incorrect.")

# False discovery rate
FDR = float(FP)/(TP+FP)*100
print(f"False Discovery Rate: {FDR}%" )
print(f"You have an abnormality, but there is a {FDR}% chance this is incorrect.")

In [None]:
# Overall accuracy
ACC = float(TP+TN)/(TP+FP+FN+TN)*100
print(f"Accuracy: {ACC}%") 

In [None]:
#Giving a summary
print(f"Sensitivity or TPR: {Sensitivity}%")    
print(f"Specificity or TNR: {Specificity}%") 
print(f"Precision: {Precision}%")   
print(f"Negative Predictive Value: {NPV}%")  
print( f"False Positive Rate: {FPR}%") 
print(f"False Negative Rate: {FNR}%")  
print(f"False Discovery Rate: {FDR}%" )
print(f"Accuracy: {ACC}%") 

### This is for the second dataset.

In [None]:
#We load the data
data = pd.read_csv('combined_csv_v2.csv')

In [None]:
#As we saw earlier, the data had some NA values in and we have directly removed it.
data = data.dropna()

### Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
#We split the data in train, test and validation accordingly.
train_data, test_data = train_test_split(data, test_size = 0.3, random_state= 42)
validation_data, test_data = train_test_split(test_data, test_size = 0.5, random_state = 42)

In [None]:
#Now let us understand these 3 variables
print(train_data.shape)
print(test_data.shape)
print(validation_data.shape)

In [None]:
#We create the sagemaker session and the region
sess_sage = sagemaker.Session()
region = boto3.Session().region_name

##We create a default bucket
bucket = sess_sage.default_bucket()
prefix = "sagemaker/oncloud"

#We define the flies that are going to be uploaded
train_file='train_data_v2.csv'
test_file='test_data_v2.csv'
validate_file='validate_data_v2.csv'

#We create a function to upload flies to s3 bucket
s3_resource = boto3.Session().resource('s3')
def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

In [None]:
#We fit the model
xgboost_model.fit(inputs=data_channels, logs=False)

### Host the model on another instance

In [None]:
#We deploy the model
xgboost_predictor = xgboost_model.deploy(initial_instance_count=1, serializer = sagemaker.serializers.CSVSerializer(), 
                                 instance_type='ml.m4.xlarge')

In [None]:
row = test_data.iloc[0:1,1:]
row.head()

batch_buffer = io.StringIO()
row.to_csv(batch_buffer, header=False, index=False)
test_row = batch_buffer.getvalue()
print(test_row)

In [None]:
#We predict the test values
xgboost_predictor.predict(test_row)

In [None]:
#We extract the test data again
batch_X = test_data.iloc[:,1:];
batch_X.head()

In [None]:
#We need to delete the end point and this is an important step
xgboost_predictor.delete_endpoint(delete_endpoint_config=True)

### Perform batch transform to evaluate the model on testing data

In [None]:
#We conduct batch processing
batch_X_file='batch-in.csv'
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

xgboost_transformer = xgboost_model.transformer(instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait()

In [None]:
#We dowload the results from S3
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])
target_predicted.head(5)

In [None]:
def binary_convert(x):
    threshold = 0.65
    if x > threshold:
        return 1
    else:
        return 0

target_predicted['binary'] = target_predicted['target'].apply(binary_convert)

print(target_predicted.head(10))
test.head(10)

test_labels = test_data.iloc[:,0]
test_labels.head()

In [None]:
#We create the confusion matrix

#We import the libraries
from sklearn.metrics import confusion_matrix

#We create the confusion matrix
matrix = confusion_matrix(test_labels, target_predicted_binary)
confusion_mat = pd.DataFrame(matrix, index=['Delayed','Not_Delayed'],columns=['Delayed','Not_Delayed'])

confusion_mat

### Report the performance metrics that you see better test the model performance

In [None]:
#To start, extract the values from the confusion matrix cells into variables.
from sklearn.metrics import roc_auc_score, roc_curve, auc

TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted_binary).ravel()

print(f"True Negative (TN) : {TN}")
print(f"False Positive (FP): {FP}")
print(f"False Negative (FN): {FN}")
print(f"True Positive (TP) : {TP}")

In [None]:
# Sensitivity, hit rate, recall, or true positive rate
Sensitivity  = float(TP)/(TP+FN)*100
print(f"Sensitivity or TPR: {Sensitivity}%")  
print(f"There is a {Sensitivity}% chance of detecting patients with an abnormality have an abnormality")

# Specificity or true negative rate
Specificity  = float(TN)/(TN+FP)*100
print(f"Specificity or TNR: {Specificity}%") 
print(f"There is a {Specificity}% chance of detecting normal patients are normal.")

In [None]:
# Precision or positive predictive value
Precision = float(TP)/(TP+FP)*100
print(f"Precision: {Precision}%")  
print(f"You have an abnormality, and the probablity that is correct is {Precision}%")

# Negative predictive value
NPV = float(TN)/(TN+FN)*100
print(f"Negative Predictive Value: {NPV}%") 
print(f"You don't have an abnormality, but there is a {NPV}% chance that is incorrect" )

# Fall out or false positive rate
FPR = float(FP)/(FP+TN)*100
print( f"False Positive Rate: {FPR}%") 
print( f"There is a {FPR}% chance that this positive result is incorrect.")

# False negative rate
FNR = float(FN)/(TP+FN)*100
print(f"False Negative Rate: {FNR}%") 
print(f"There is a {FNR}% chance that this negative result is incorrect.")

# False discovery rate
FDR = float(FP)/(TP+FP)*100
print(f"False Discovery Rate: {FDR}%" )
print(f"You have an abnormality, but there is a {FDR}% chance this is incorrect.")

In [None]:
# Overall accuracy
ACC = float(TP+TN)/(TP+FP+FN+TN)*100
print(f"Accuracy: {ACC}%") 

In [None]:
#Giving a summary
print(f"Sensitivity or TPR: {Sensitivity}%")    
print(f"Specificity or TNR: {Specificity}%") 
print(f"Precision: {Precision}%")   
print(f"Negative Predictive Value: {NPV}%")  
print( f"False Positive Rate: {FPR}%") 
print(f"False Negative Rate: {FNR}%")  
print(f"False Discovery Rate: {FDR}%" )
print(f"Accuracy: {ACC}%") 

In [None]:
# Write the final comments here and turn the cell type into markdown