Supervised Learning! 

**Problem:** Should a staff material-request be sent to the technical approver or not. This project analyses previous data and features where material requests have been sent to the company's technical team 


XGBoost will be used for the machine learning task due to the following reasons;

1) adequate for small data
2) requires less tuning
3) produces fairly good accuracy.

# Loading and Examining the data 

In [3]:
data_bucket = 'bucketname' #Definfing bucket name
subfolder = 'subfolder name' #Defining sub-folder name
dataset = 'orders_with_predicted_value.csv' # defining the dataset 

In [4]:
import numpy as np
import pandas as pd
import boto3 # Used to interact with S3
import sagemaker # Used to interact with Sagemaker
import s3fs # Makes it easier for boto3 to interact with S3
from sklearn.model_selection import train_test_split 

In [5]:
role = sagemaker.get_execution_role()
s3 = s3fs.S3FileSystem(anon=False) # Connects to the S3 Storage

Viewing the dataset

In [7]:
df = pd.read_csv(f's3://{data_bucket}/{subfolder}/{dataset}')
df.head()

Unnamed: 0,tech_approval_required,requester_id,role,product,quantity,price,total
0,0,E2300,tech,Desk,1,664,664
1,0,E2300,tech,Keyboard,9,649,5841
2,0,E2374,non-tech,Keyboard,1,821,821
3,1,E2374,non-tech,Desktop Computer,24,655,15720
4,0,E2327,non-tech,Desk,1,758,758


In [8]:
df

Unnamed: 0,tech_approval_required,requester_id,role,product,quantity,price,total
0,0,E2300,tech,Desk,1,664,664
1,0,E2300,tech,Keyboard,9,649,5841
2,0,E2374,non-tech,Keyboard,1,821,821
3,1,E2374,non-tech,Desktop Computer,24,655,15720
4,0,E2327,non-tech,Desk,1,758,758
...,...,...,...,...,...,...,...
995,1,E2364,non-tech,Laptop Computer,1,116,116
996,1,E2357,non-tech,Laptop Computer,1,1132,1132
997,0,E2330,non-tech,Keyboard,2,804,1608
998,0,E2384,non-tech,Desk,3,270,810


Number of unique values in the 'tech-appv' column

In [9]:
df['tech_approval_required'].nunique()

2

Number of request sent to tech approver or not

In [10]:
yes_and_no = df.tech_approval_required.value_counts()
pd.DataFrame(yes_and_no)

Unnamed: 0,tech_approval_required
0,807
1,193


Preparing the data

Converting CAT data to NUM 

In [11]:
encoded_data = pd.get_dummies(df)
encoded_data.head()

Unnamed: 0,tech_approval_required,quantity,price,total,requester_id_E2300,requester_id_E2301,requester_id_E2302,requester_id_E2303,requester_id_E2304,requester_id_E2306,...,requester_id_E2400,role_non-tech,role_tech,product_Chair,product_Cleaning,product_Desk,product_Desktop Computer,product_Keyboard,product_Laptop Computer,product_Mouse
0,0,1,664,664,1,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
1,0,9,649,5841,1,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
2,0,1,821,821,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
3,1,24,655,15720,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
4,0,1,758,758,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


Checking for Correlations among features

In [12]:
corrs = encoded_data.corr()['tech_approval_required'].abs()

In [13]:
columns = corrs[corrs > 0.10].index

In [14]:
columns

Index(['tech_approval_required', 'role_non-tech', 'role_tech', 'product_Chair',
       'product_Cleaning', 'product_Desk', 'product_Desktop Computer',
       'product_Keyboard', 'product_Laptop Computer', 'product_Mouse'],
      dtype='object')

In [15]:
corrs = corrs.filter(columns)
corrs

tech_approval_required      1.000000
role_non-tech               0.122454
role_tech                   0.122454
product_Chair               0.134168
product_Cleaning            0.191539
product_Desk                0.292137
product_Desktop Computer    0.752144
product_Keyboard            0.242224
product_Laptop Computer     0.516693
product_Mouse               0.190708
Name: tech_approval_required, dtype: float64

Filtering/Showing only important columns

In [16]:
encoded_data = encoded_data[columns]
encoded_data.head()

Unnamed: 0,tech_approval_required,role_non-tech,role_tech,product_Chair,product_Cleaning,product_Desk,product_Desktop Computer,product_Keyboard,product_Laptop Computer,product_Mouse
0,0,0,1,0,0,1,0,0,0,0
1,0,0,1,0,0,0,0,1,0,0
2,0,1,0,0,0,0,0,1,0,0
3,1,1,0,0,0,0,1,0,0,0
4,0,1,0,0,0,1,0,0,0,0


Creating Training, Validation and Testing datasets

Splitting Data

In [17]:
train_df, val_and_test_data = train_test_split(encoded_data, test_size=0.3,random_state=0)

val_df, test_df = train_test_split(val_and_test_data,test_size=0.333,random_state=0)

Converting Data to csv

In [18]:
train_data = train_df.to_csv(None, header=False, index=False).encode() # Doesn't include column header
val_data = val_df.to_csv(None, header=False, index=False).encode() #Encode is to ensure text in csv is saved in the right format
test_data = test_df.to_csv(None, header=True, index=False).encode() #Includes column header

Saving the CSV files to S3.

In [19]:
with s3.open(f'{data_bucket}/{subfolder}/processed/train.csv', 'wb') as f:
    f.write(train_data)
with s3.open(f'{data_bucket}/{subfolder}/processed/val.csv', 'wb') as f:
    f.write(val_data)
with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv', 'wb') as f:
    f.write(test_data)

Preparing the CSV data for SageMaker

In [21]:
train_input = sagemaker.TrainingInput(s3_data=f's3://{data_bucket}/{subfolder}/processed/train.csv', content_type='csv')

val_input = sagemaker.TrainingInput(s3_data=f's3://{data_bucket}/{subfolder}/processed/val.csv', content_type='csv')

# SageMaker Experiment

Sagemaker is very useful, as we know machine learning process is an iterative process with requires several changes, modifications/tuning, various prepocess activivties.
the sagemaker experiment makes it possible to track these changes and processes 

In [22]:
# Importing necessary library

import time
from time import strftime

!pip install sagemaker-experiments 
from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

Collecting sagemaker-experiments
  Downloading sagemaker_experiments-0.1.33-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 200 kB/s eta 0:00:011
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.33


In [23]:
# get the execution role and craeting a sagemaker session

role = sagemaker.get_execution_role()
sm_sess = sagemaker.session.Session()

In [24]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")
demo_experiment = Experiment.create(experiment_name = "DEMO-{}".format(create_date),
                                    description = "Demo experiment",
                                    tags = [{'Key': 'demo-experiments', 'Value': 'demo1'}])
print(demo_experiment)

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f6364c66240>,experiment_name='DEMO-2021-06-10-22-34-04',description='Demo experiment',tags=[{'Key': 'demo-experiments', 'Value': 'demo1'}],experiment_arn='arn:aws:sagemaker:us-east-2:077107849065:experiment/demo-2021-06-10-22-34-04',response_metadata={'RequestId': '31e0f579-2238-423d-9b80-71b7e2344ab6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '31e0f579-2238-423d-9b80-71b7e2344ab6', 'content-type': 'application/x-amz-json-1.1', 'content-length': '96', 'date': 'Thu, 10 Jun 2021 22:34:03 GMT'}, 'RetryAttempts': 0})


In [25]:
raw_data_location = "s3://bucketname/subfolder name/orders_with_predicted_value.csv"
train_data_location = "s3://bucketname/subfolder name/processed/train.csv"
val_data_location = "s3://bucketname/subfolder name/processed/val.csv"
test_data_location = "s3://bucketname/subfolder name/processed/test.csv"

In [26]:
# Start Tracking parameters used in the Pre-processing pipeline.

with Tracker.create(display_name="Preprocessing") as tracker:
    tracker.log_parameters({"random_state":0})
    
    # we can log the s3 uri to the dataset we just uploaded
    tracker.log_input(name="ccdefault-raw-dataset", media_type="s3/uri", value=raw_data_location)
    tracker.log_input(name="ccdefault-train-dataset", media_type="s3/uri", value=train_data_location)
    tracker.log_input(name="ccdefault-val-dataset", media_type="s3/uri", value=val_data_location)
    tracker.log_input(name="ccdefault-test-dataset", media_type="s3/uri", value=test_data_location)

# Train the Model

In [27]:
sess = sagemaker.Session()

container = sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'xgboost','latest')

# Tracking/Monitoring
preprocessing_trial_component = tracker.trial_component

trial_name = "training-job-{}".format(create_date)
demo_trial = Trial.create(trial_name=trial_name, experiment_name=demo_experiment.experiment_name)

demo_trial.add_trial_component(preprocessing_trial_component)
demo_training_job_name = "demo-training-job-{}".format(create_date)


xgb = sagemaker.estimator.Estimator(container, role, train_instance_count=1, train_instance_type='ml.m5.large', 
                                          output_path= f's3://{data_bucket}/{subfolder}/output', sagemaker_session=sess)

xgb.set_hyperparameters(
                                max_depth=5, 
                                subsample=0.7, 
                                objective='binary:logistic', 
                                eval_metric = 'auc',
                                num_round=100,
                                early_stopping_rounds=10
                        )

xgb.fit(
    inputs = {'train': train_input, 'validation': val_input},
     job_name=demo_training_job_name,  
     experiment_config={
            "TrialName": demo_trial.trial_name, #log training job in Trials for lineage
            "TrialComponentDisplayName": "Training",
        },
        wait=True,
    )
time.sleep(2)


See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: demo-training-job-2021-06-10-22-34-04


2021-06-10 22:34:42 Starting - Starting the training job...
2021-06-10 22:35:05 Starting - Launching requested ML instancesProfilerReport-1623364482: InProgress
......
2021-06-10 22:36:05 Starting - Preparing the instances for training......
2021-06-10 22:37:05 Downloading - Downloading input data...
2021-06-10 22:37:40 Training - Training image download completed. Training in progress.
2021-06-10 22:37:40 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2021-06-10:22:37:35:INFO] Running standalone xgboost training.[0m
[34m[2021-06-10:22:37:35:INFO] File size need to be processed in the node: 0.02mb. Available memory size in the node: 85.39mb[0m
[34m[2021-06-10:22:37:35:INFO] Determined delimiter of CSV input is ','[0m
[34m[22:37:35] S3DistributionType set as FullyReplicated[0m
[34m[22:37:35] 700x9 matrix with 6300 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-06-10:22:37:35:INFO] Determined delimi

# Hosting the Model

In [29]:
# importing Libraries

from sagemaker.predictor import (
    json_serializer,
    csv_serializer,
    json_deserializer,
    RealTimePredictor,
)

In [31]:
endpoint_name = 'order-approval'
try:
    sess.delete_endpoint(endpoint_name)
    print('Warning: Existing endpoint deleted to make way for your new endpoint.')
    sleep(30)
except:
    pass    

INFO:sagemaker:Deleting endpoint with name: order-approval


In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer

INFO:matplotlib.font_manager:Generating new fontManager, this may take some time...


In [34]:
xgb_predictor = xgb.deploy(
                            initial_instance_count=1, 
                            instance_type="ml.m4.xlarge", 
                            serializer=CSVSerializer(), 
                            endpoint_name=endpoint_name
                            )

INFO:sagemaker:Creating model with name: xgboost-2021-06-10-22-54-54-456
INFO:sagemaker:Creating endpoint with name order-approval


---------------!

In [38]:
test_df

Unnamed: 0,tech_approval_required,role_non-tech,role_tech,product_Chair,product_Cleaning,product_Desk,product_Desktop Computer,product_Keyboard,product_Laptop Computer,product_Mouse
518,0,1,0,0,0,1,0,0,0,0
773,0,1,0,0,0,0,0,0,0,1
784,0,0,1,0,0,1,0,0,0,0
592,0,1,0,0,0,1,0,0,0,0
181,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
860,0,1,0,0,0,0,0,1,0,0
873,0,1,0,1,0,0,0,0,0,0
862,0,1,0,0,1,0,0,0,0,0
999,0,1,0,0,0,0,0,0,0,1


In [39]:
def predict(data, rows=100):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = ",".join([predictions, xgb_predictor.predict(array).decode("utf-8")])

    return np.fromstring(predictions[1:], sep=",")

In [40]:
predictions = predict(test_df.to_numpy()[:, 1:])

In [41]:
print(predictions)

[0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.6375407
 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.35572106 0.35572106 0.61951733 0.35572106 0.35572106 0.35572106
 0.61951733 0.6375407  0.61951733 0.35572106 0.35572106 0.35572106
 0.61951733 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.6375407  0.35572106 0.6375407  0.35572106 0.35572106 0.35572106
 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.35572106 0.35572106 0.6375407  0.61951733 0.35572106 0.35572106
 0.35572106 0.35572106 0.35572106 0.35572106 0.6375407  0.35572106
 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.61951733 0.35572106 0.35572106 0.35572106 0.35572106 0.35572106
 0.61951733 0.6375407  0.35572106 0.35572106 0.61951733 0.35572

In [42]:
pd.crosstab(
    index=test_df.iloc[:, 0],
    columns=np.round(predictions),
    rownames=["actual"],
    colnames=["predictions"],
)

predictions,0.0,1.0
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,85,1
1,0,14


In [60]:
test_data = test_df

In [83]:
pred = np.round(predictions)
pred = pd.DataFrame(pred)
pred.columns=['Prediction']

In [62]:
test_data = test_data.reset_index(drop=True)

In [85]:
result = pd.merge(pred,test_data, left_index=True, right_index=True)

In [89]:
result = result.astype('int32')

In [90]:
result

Unnamed: 0,Prediction,tech_approval_required,role_non-tech,role_tech,product_Chair,product_Cleaning,product_Desk,product_Desktop Computer,product_Keyboard,product_Laptop Computer,product_Mouse
0,0,0,1,0,0,0,1,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,1
2,0,0,0,1,0,0,1,0,0,0,0
3,0,0,1,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,1,0,0,0,0,0,1,0,0
96,0,0,1,0,1,0,0,0,0,0,0
97,0,0,1,0,0,1,0,0,0,0,0
98,0,0,1,0,0,0,0,0,0,0,1


In [93]:
from sklearn.metrics import accuracy_score

In [94]:
print("Baseline Accuracy = {}".format(1- np.unique(df['tech_approval_required'], return_counts=True)[1][1]/(len(df['tech_approval_required']))))
print("Accuracy Score = {}".format(accuracy_score(test_df['tech_approval_required'], result['Prediction'])))

Baseline Accuracy = 0.8069999999999999
Accuracy Score = 0.99
