# Using IP Insights to score security findings
-------
Amazon SageMaker IP Insights is an unsupervised anomaly detection algorithm for susicipous IP addresses that uses statistical modeling and neural networks to capture associations between online resources (such as account IDs or hostnames) and IPv4 addresses. Under the hood, it learns vector representations for online resources and IP addresses.  
  
As a result, if the vector representing an IP address and an online resource are close together, then it is likely (not surprising) for that IP address to access that online resource, even if it has never accessed it before.

In this notebook, we use the Amazon SageMaker IP Insights algorithm to train a model using the `<principal ID, IP address`> tuples we generated from the CloudTrail log data, and then use the model to perform inference on the same type of tuples generated from GuardDuty findings to determine how unusual it is to see a particular IP address for a given principal involved with a finding.

After running this notebook, you should be able to:

- obtain, transform, and store data for use in Amazon SageMaker,
- create an AWS SageMaker training job to produce an IP Insights model,
- use the model to perform inference with an Amazon SageMaker endpoint.

If you would like to know more, please check out the [SageMaker IP Inisghts Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html).

## Setup
------
*This notebook was created and tested on a ml.m4.xlarge notebook instance. We recommend using the same, but other instance types should still work.*

The following is a cell that contains Python code.  It can be run in two ways:  
1. Selecting the cell (click anywhere inside it), and then clicking the button above labelled "Run".  
2. Selecting the cell (click anywhere inside it), and typing Shift+Return on your keyboard.  

When a cell is running, you will see a star(\*\) in the brackets to the left (e.g., `In [*]`), and when it has completed you will see a number in the brackets. Each click of "Run" will execute the next cell in the notebook.

Go ahead and click **Run** now. You should see the text in the `print` statement get printed just beneath the cell.

All of these cells share the same interpreter, so if a cell imports modules, like this one does, those modules will be available to every subsequent cell.

In [None]:
import boto3
import botocore
import os
import sagemaker

print("Welcome to IP Insights!")

### ACTION: Configure Amazon S3 Bucket

Before going further, we need to specify the S3 bucket that SageMaker will use for input and output data for the model, which will be the bucket where our training and inference tuples from CloudTrail logs and GuardDuty findings, respectively, are located. Edit the following cell to specify the name of the bucket and then run it; you do not need to change the prefix.

In [None]:
bucket = 'aws-reinvent2019-secml-builder-sessions'
prefix = ''

Finally, run the next cell to complete the setup.

In [None]:
execution_role = sagemaker.get_execution_role()

# Check if the bucket exists
try:
    boto3.Session().client('s3').head_bucket(Bucket=bucket)
except botocore.exceptions.ParamValidationError as e:
    print('Hey! You either forgot to specify your S3 bucket'
          ' or you gave your bucket an invalid name!')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == '403':
        print("Hey! You don't have permission to access the bucket, {}.".format(bucket))
    elif e.response['Error']['Code'] == '404':
        print("Hey! Your bucket, {}, doesn't exist!".format(bucket))
    else:
        raise
else:
    print('Training input/output will be stored in: s3://{}/{}'.format(bucket, prefix))

## Training

Execute the two cells below to start training. Training should take several minutes to complete, and some logging information will output to the display. (These logs are also available in CloudWatch.) You can look at various training metrics in the log as the model trains.  
When training is complete, you will see log output like this:  
>`2019-02-11 20:34:41 Completed - Training job completed`  
>`Billable seconds: 71`

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

image = get_image_uri(boto3.Session().region_name, 'ipinsights')


# Configure SageMaker IP Insights input channels
train_key = os.path.join(prefix, 'train', 'cloudtrail_tuples.csv')
s3_train_data = 's3://{}/{}'.format(bucket, train_key)

input_data = {
    'train': sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/csv')
}

In [None]:
# Set up the estimator with training job configuration
ip_insights = sagemaker.estimator.Estimator(
    image, 
    execution_role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sagemaker.Session())

# Configure algorithm-specific hyperparameters
ip_insights.set_hyperparameters(
    num_entity_vectors='20000',
    random_negative_sampling_rate='5',
    vector_dim='128', 
    mini_batch_size='1000',
    epochs='5',
    learning_rate='0.01',
)

# Start the training job (should take 3-4 minutes to complete)  
ip_insights.fit(input_data)

In [None]:
print('Training job name: {}'.format(ip_insights.latest_training_job.job_name))


# Now Deploy Model
Execute the cell below to deploy the trained model on an endpoint for inference. It should take 5-7 minutes to spin up the instance and deploy the model (the horizontal dashed line represents progress, and it will print an exclamation point \[!\] when it is complete).

In [None]:
# NOW DEPLOY MODEL
predictor = ip_insights.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

In [None]:
# SHOW ENDPOINT NAME
print('Endpoint name: {}'.format(predictor.endpoint))

## Inference
Now that we have trained the model on known data, we can pass new data to it to generate scores.  We want to see if our new data looks normal, or anomalous.  
We can pass data in a variety of formats to our inference endpoint. In this example, we will pass CSV-formmated data.

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.accept = 'application/json'
predictor.deserializer = json_deserializer

When queried by a principal and an IPAddress, the model returns a score (called 'dot_product') which indicates how expected that event is. In other words, *the higher the dot_product, the more normal the event is.*  
Let's first run the inference on the training (normal) data for sanity check.

In [None]:
import pandas as pd

# Run inference on training (normal) data for sanity check
s3_infer_data = 's3://{}/{}'.format(bucket, train_key)
inference_data = pd.read_csv(s3_infer_data)
print(inference_data.head())
train_dot_products = predictor.predict(inference_data.values)

In [None]:
# Prepare for plotting by collecting just the dot products
train_plot_data = [x['dot_product'] for x in train_dot_products['predictions']]
train_plot_data[:10]

In [None]:
# Plot the training data inference values as a histogram
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

n, bins, patches = plt.hist(train_plot_data, 10, facecolor='blue')
plt.xlabel('IP Insights Score')
plt.ylabel('Frequency')
plt.show()


Notice (almost) all the values above are greater than zero.
Now let's run inference on the GuardDuty findings. Since they are from GuardDuty alerts, we expect them to be generally more anomalous, so we would expect to see lower scores...

In [None]:
# Run inference on GuardDuty findings
infer_key = os.path.join(prefix, 'infer', 'guardduty_tuples.csv')
s3_infer_data = 's3://{}/{}'.format(bucket, infer_key)
inference_data = pd.read_csv(s3_infer_data)
print(inference_data.head())
GuardDuty_dot_products = predictor.predict(inference_data.values)

In [None]:
# Prepare GuardDuty data for plotting by collecting just the dot products
GuardDuty_plot_data = [x['dot_product'] for x in GuardDuty_dot_products['predictions']]
GuardDuty_plot_data[:10]

In [None]:
# Plot both the training data and the GuardDuty data together so we can compare

nT, binsT, patchesT = plt.hist(GuardDuty_plot_data, 10, facecolor='red')
nG, binsG, patchesG = plt.hist(train_plot_data, 10, facecolor='blue')

plt.legend(["GuardDuty", "Training"])
plt.xlabel('IP Insights Score')
plt.ylabel('Frequency')
plt.show()

Aha! While the GuardDuty sample is small, we can see that these scores are generally lower than the scores for normal (training) data.  (Due to randomness in the training model, the precise dot product values and ranges will vary between models.)

## Choosing a Scoring Threshold  
It is reasonable to ask, "What's the cutoff?" Which inference scores should we lable "anomalous" and which should we label "ok"? We know the lowest scores are the most likely anaomalous candidates, but what about the others?  
There is no universal threshold above which is "normal/ok" and below which is "anomalous".  Each domain and data set scores differently.  An acceptable threshold (say, 0.0) for one data set may not be appropriate for another.  

So, what do we do? A common approach is to train with two types of data: data that is known "normal" and data that is known malicious.  If we have both, we can compare the scores of the two to find a good cut-off.  (Of course, known malicious data can be hard to come by.  In the full IPInsight tutorial, you can learn about a simple method for simulating malicious web traffic.)  

A good way to see the comparison between the scores of two data sets is to plot the two distributions - both normal and anomalous - and see how they interact.  
  
The results of this are much easier to see with larger data sets. (We used smaller data sets above to keep the computation running times down.)

**Your workshop instructor will show you example graphs drawn from larger data sets to make this concept clearer.**

With a larget data set, it is easier to see separation between normal traffic and suspicious traffic. We could select a threshold depending on the application:

    For example, if we were working with low impact decisions - such as whether to ask for another authentical factor during login - we could use a lower threshold = *<<insert appropriate value>>*. This would result in catching more true-positives, at the cost of more false-positives.

    On the other hand, if our decision system were more sensitive to false positives (e.g., devoting expert analyst time to investigating suspicious activity), we could choose a higher threshold, such as threshold = *<<insert another appropriate value>>*. That way if we were sending the flagged cases to manual investigation, we would have a higher confidence that the acitivty was suspicious.

## How would we put all this together?
So far, you have built and endpoint, and did some analysis to determine a scoring threshold to decide which IP addresses should be tagged Anomalous.
One way to build this into a live detection stream will be shown and discussed towards the end of the lab...
<ol>
<li>Use CloudWatch Events to capture specific types of GuardDuty alerts of interest.</li>
<li> Pass the information about that GuardDuty alert to a special Lambda function you will write.
<li>Lambda:
    <ol>
        <li> Takes a GuardDuty alert, finds the originating IP address from the json.
        <li> Send that IP address to the inference endpoint you just built with your model, and get back a score.
        <li> Compare that score to the threshold you determined for anomolous.
        <li> Send appropriate alerts (SNS, CWE, other) if IP address is sufficiently anomolous.
    </ol>
</ol>

In [None]:
# Uncomment to delete the endpoint (i.e., to avoid any recurring charges)
#sage.delete_endpoint(predictor.endpoint)