# Customer Malware Detection with XGBoost
_**Using Gradient Boosted Trees to Predict Malware detection in Windows machines**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Compile](#Compile)
1. [Host](#Host)
  1. [Evaluate](#Evaluate)
  1. [Relative cost of errors](#Relative-cost-of-errors)
1. [Extensions](#Extensions)

---

## Background

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [1]:
import sagemaker
sess = sagemaker.Session()
bucket='ml2-assignment-sagemaker-swamy'
prefix = 'xgboost'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer

---
## Data

We have already cleaned up the data and performed all preprocessing steps and stored it in S3

---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [3]:
container = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, '1')
display(container)

'811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:1'

Then, because we're training with the CSV file format, we'll create `TrainingInput`s that our training function can use as a pointer to the files in S3.

In [4]:
s3_input_train = TrainingInput(s3_data='s3://{}/{}/train.csv'.format(bucket, prefix), content_type='csv')
s3_input_validation = TrainingInput(s3_data='s3://{}/{}/test.csv'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [5]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=8,
                        eta=0.36,
                        gamma=6.5,
                        min_child_weight=14,
                        subsample=0.85,
                        silent=0,
                        objective='binary:logistic',
                        num_round=800,
                        colsample_bytree=0.15,
                        eval_metric='auc')

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

2021-03-08 04:37:17 Starting - Starting the training job...
2021-03-08 04:37:40 Starting - Launching requested ML instancesProfilerReport-1615178236: InProgress
.........
2021-03-08 04:39:01 Starting - Preparing the instances for training......
2021-03-08 04:40:02 Downloading - Downloading input data...
2021-03-08 04:40:43 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2021-03-08:04:40:43:INFO] Running standalone xgboost training.[0m
[34m[2021-03-08:04:40:43:INFO] File size need to be processed in the node: 526.96mb. Available memory size in the node: 8426.34mb[0m
[34m[2021-03-08:04:40:43:INFO] Determined delimiter of CSV input is ','[0m
[34m[04:40:43] S3DistributionType set as FullyReplicated[0m
[34m[04:40:46] 1022809x124 matrix with 126828316 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-03-08:04:40:46:INFO] Determined delimiter of CSV input is ','[0m
[34m[04:40:46] S3Di

---
## Host

Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint.

In [6]:
xgb_predictor = xgb.deploy(
    initial_instance_count = 1, 
    instance_type = 'ml.m4.xlarge',
    serializer=CSVSerializer())

-----------------!

### Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [10]:
import pandas as pd
import boto3

bucket = "ml2-assignment-sagemaker-swamy"
file_name = "test.csv"

s3 = boto3.client('s3') 
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket= bucket, Key= file_name) 
# get object and file (key) from bucket

test_data = pd.read_csv(obj['Body']) # 'Body' is a key word

In [11]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.to_numpy()[:,1:])

In [13]:
# View the predictions
predictions

array([0.35684675, 0.27612132, 0.07477946, ..., 0.65090072, 0.48381004,
       0.74736536])

In [14]:
# construct confusion matrix for the predictions
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

predictions,0.0,1.0
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,157150,64714
1,79350,137133


In [15]:
# identify TP, TN, FP, FN
#TP = confusion[1,1] # true positive
TP = 137133
# TN = confusion[0,0] # true negatives
TN = 157150
#FP = confusion[0,1] # false positives
FP = 64714
#FN = confusion[1,0] # false negatives
FN = 79350

### Calculation of Accuracy and F1 Score

### Accuracy
(TP + TN)/(TP + TN + FP + FN)

In [16]:
(137133+157150)/(137133 + 157150 + 64714 + 79350)

0.6713471291009178

### F1 Score
#### 2 * ((Recall * Precision)/(Recall + Precision))
#### Recall = (TP)/(TP + FN)
#### Precision = TP/(TP+FP)

In [17]:
# Recall
137133/(137133+79350)

0.6334585163731101

In [18]:
# Precision
137133/(137133+64714)

0.6793908257244349

In [19]:
# F1 Score 
2*((0.6334585163731101*0.6793908257244349)/(0.6334585163731101+0.6793908257244349))

0.6556211603279708