# End-to-End NLP: News Headline Classifier (SageMaker Version)

This notebook trains a Keras-based model to classify news headlines between four domains: Business (b), Entertainment (e), Health & Medicine (m) and Science & Technology (t).

Following on from the previous local-mode notebook, we show how to trigger the model training and deployment on separate infrastructure - to make better use of resources.


### Set Up Execution Role and Session

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [1]:
%%time
%load_ext autoreload
%autoreload 2

import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()


arn:aws:iam::344028372807:role/service-role/AmazonSageMaker-ExecutionRole-20190212T154595
CPU times: user 903 ms, sys: 80.8 ms, total: 984 ms
Wall time: 10.1 s


### Download News Aggregator Dataset

The News Aggregator Dataset is available at the public **UCI Machine Learning Database** repository (these files should already be downloaded in previous notebook)


In [2]:
import util.preprocessing

util.preprocessing.download_dataset()


Using TensorFlow backend.



Downloading data...
Saved to data/ folder


### Let's visualize the dataset

We will load the newsCorpora.csv file to a Pandas dataframe for our data processing work

In [3]:
import os
import re
import numpy as np
import pandas as pd


In [4]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
df = pd.read_csv("data/newsCorpora.csv", names=column_names, header=None, delimiter="\t")
df.head()


Unnamed: 0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable


In [5]:
df["CATEGORY"].value_counts()


e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

The dataset has four article categories: Business (b), Entertainment (e), Health & Medicine (m) and Science & Technology (t).


## Natural Language Pre-Processing

We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.

We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.


### Dummy Encode the Labels


In [6]:
encoded_y, labels = util.preprocessing.dummy_encode_labels(df, "CATEGORY")
print(labels)


['b' 'e' 'm' 't']


In [7]:
df["CATEGORY"][1]

'b'

In [8]:
encoded_y[0]

array([1., 0., 0., 0.], dtype=float32)

### Tokenize and Set Fixed Sequence Lengths

We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.


In [9]:
padded_docs, tokenizer = util.preprocessing.tokenize_pad_docs(df, "TITLE")


Vocabulary size: 75286
Number of headlines: 422419


In [10]:
df["TITLE"][1]

'Fed official says weak data caused by weather, should not slow taper'

In [11]:
padded_docs[0]

array([ 215,  452,   25, 1062,   84, 1970,   19, 1081,  270,   37, 1412,
       7900,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0], dtype=int32)

### Import Word Embeddings

To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using pre-built GloVe word embeddings.

You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.


In [12]:
embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, "data/embeddings")


Downloading Glove word embeddings...
Unzipping...
Loading into memory...
Loaded 400000 word vectors.


In [13]:
np.save(
    file="./data/embeddings/docs-embedding-matrix",
    arr=embedding_matrix,
    allow_pickle=False,
)
vocab_size=embedding_matrix.shape[0]
print(embedding_matrix.shape)


(400000, 100)


### Split Train and Test Sets

Finally we need to divide our data into model training and evaluation sets:


In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    padded_docs,
    encoded_y,
    test_size=0.2,
    random_state=42
)


In [15]:
os.makedirs("./data/train", exist_ok=True)
np.save("./data/train/train_X.npy", X_train)
np.save("./data/train/train_Y.npy", y_train)
os.makedirs("./data/test", exist_ok=True)
np.save("./data/test/test_X.npy", X_test)
np.save("./data/test/test_Y.npy", y_test)


## Upload Data to S3

We'll need to upload our processed data to S3 to make it available for SageMaker training jobs:


In [16]:
s3_bucket = sess.default_bucket()
s3_prefix = "news"


In [17]:
traindata_s3_prefix = f"{s3_prefix}/data/train"
testdata_s3_prefix = f"{s3_prefix}/data/test"
embeddings_s3_prefix = f"{s3_prefix}/data/embeddings"
output_s3 = f"s3://{s3_bucket}/{s3_prefix}/models/"


In [18]:
train_s3 = sess.upload_data(path="./data/train/", bucket=s3_bucket, key_prefix=traindata_s3_prefix)
test_s3 = sess.upload_data(path="./data/test/", bucket=s3_bucket, key_prefix=testdata_s3_prefix)
embeddings_s3 = sess.upload_data(
    # Only send the numpy array of embeddings, not the original files as well:
    path="./data/embeddings/docs-embedding-matrix.npy",
    bucket=s3_bucket,
    key_prefix=embeddings_s3_prefix,
)


In [19]:
inputs = { "train": train_s3, "test": test_s3, "embeddings": embeddings_s3 }
print(inputs)


{'train': 's3://sagemaker-ap-southeast-1-344028372807/news/data/train', 'test': 's3://sagemaker-ap-southeast-1-344028372807/news/data/test', 'embeddings': 's3://sagemaker-ap-southeast-1-344028372807/news/data/embeddings/docs-embedding-matrix.npy'}


## Train with Differentiated Infrastructure on Sagemaker

This time, we've packaged the model build and train code from our previous notebook ([Headline Classifier Local.ipynb](Headline%20Classifier%20Local.ipynb)) into the [**main.py**](src/main.py) script in the **src** directory.

We'll use the high-level TensorFlow SDK to train our model on SageMaker.

You can explore the script file for more details on the interface.


### How Amazon SageMaker runs your Tensorflow script with pre-built containers

Amazon Sagemaker has pre packaged a set of Docker images to help you accelerate building your projects. This what is driving the Sagemkaer Tensorflow Estimator. You can use the same Tensorflow image for training and/or hosting. You can find more information in the following: https://github.com/aws/sagemaker-tensorflow-container , https://sagemaker.readthedocs.io/en/stable/using_tf.html , https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html .


#### Running your container during training

When Amazon SageMaker runs training, your training script (entry_point input) is run just like a regular Python program. A number of files are laid out for your use, under a `/opt/ml` directory. These will be locations that you can access from within your script. You will see an example of the use of this in our [**main.py**](src/main.py) :

    /opt/ml
    |-- code
    |   `-- <our script(s)>
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.


In [20]:
import sagemaker
from sagemaker.pytorch import PyTorch

Although the script will run on a separate container, we can pass whatever parameters it needs through SageMaker:


In [21]:
hyperparameters = { "epochs": 1, "vocab_size": vocab_size, "num_classes": 4 }


We have our `TensorFlow` estimator object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `TensorFlow` estimator classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

We will run the training job using a ml.p3.2xlarge instance (GPUs) to accelerate our training. If your account runs into resource limits please use a ml.c5.xlarge instance. 

In [120]:
estimator = PyTorch(
    entry_point="main.py",
    source_dir="./src",
    train_instance_count=1,
    train_instance_type="local",
    role=role,
    framework_version="1.6.0",
    py_version="py3",
    hyperparameters=hyperparameters,
    metric_definitions=[
       { "Name": "Epoch", "Regex": "epoch: ([0-9\\.]+)" },
        { "Name": "Train:Loss", "Regex": "train_loss: ([0-9\\.]+)" },
        { "Name": "Validation:Loss", "Regex": "val_loss=([0-9\\.]+)" },
        { "Name": "Validation:Accuracy", "Regex": "val_acc=([0-9\\.]+)" },
    ],
    base_job_name="news-pytorch",
    #train_use_spot_instances=True,       # Use spot instances to reduce cost
    #train_max_run=20*60,                 # Maximum allowed active runtime
    #train_max_wait=30*60,                # Maximum clock time (including spot delays)
)

estimator.fit(inputs)


's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Creating tmppsvovba5_algo-1-n8109_1 ... 
[1BAttaching to tmppsvovba5_algo-1-n8109_12mdone[0m
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:09,904 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:09,906 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:09,918 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:09,924 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:11,084 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:11,098 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-n8109_1  |[0m 2020-09-20 14:54:11,112 sagemaker-training-toolkit INFO     No GPU

[36malgo-1-n8109_1  |[0m 40
[36malgo-1-n8109_1  |[0m 41
[36malgo-1-n8109_1  |[0m 42
[36malgo-1-n8109_1  |[0m 43
[36malgo-1-n8109_1  |[0m 44
[36malgo-1-n8109_1  |[0m 45
[36malgo-1-n8109_1  |[0m 46
[36malgo-1-n8109_1  |[0m 47
[36malgo-1-n8109_1  |[0m 48
[36malgo-1-n8109_1  |[0m 49
[36malgo-1-n8109_1  |[0m 50
[36malgo-1-n8109_1  |[0m 51
[36malgo-1-n8109_1  |[0m 52
[36malgo-1-n8109_1  |[0m 53
[36malgo-1-n8109_1  |[0m 54
[36malgo-1-n8109_1  |[0m 55
[36malgo-1-n8109_1  |[0m 56
[36malgo-1-n8109_1  |[0m 57
[36malgo-1-n8109_1  |[0m 58
[36malgo-1-n8109_1  |[0m 59
[36malgo-1-n8109_1  |[0m 60
[36malgo-1-n8109_1  |[0m 61
[36malgo-1-n8109_1  |[0m 62
[36malgo-1-n8109_1  |[0m 63
[36malgo-1-n8109_1  |[0m 64
[36malgo-1-n8109_1  |[0m 65
[36malgo-1-n8109_1  |[0m 66
[36malgo-1-n8109_1  |[0m 67
[36malgo-1-n8109_1  |[0m 68
[36malgo-1-n8109_1  |[0m 69
[36malgo-1-n8109_1  |[0m 70
[36malgo-1-n8109_1  |[0m 71
[36malgo-1-n8109_1  |[0m 72
[36malgo-

While the training job is running take a minute to look at the `main.py` script. You can see how we have adapted the our original local code from [Headline Classifier Local.ipynb](Headline%20Classifier%20Local.ipynb) to run on Sagemaker.

## Use the Model: Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.


In [101]:
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.medium",
)


Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


-----------------!

### (**JupyterLab Only**) Installing IPyWidgets Extension

This notebook uses a fun little interactive widget to query the classifier, so **ONLY if you're using JupyterLab** (no action needed for plain Jupyter users) you'll need to install an extension to enable it:

- Select "*Settings > Enable Extension Manager (experimental)*" from the toolbar, and confirm to enable it
- Click on the new jigsaw puzzle piece icon in the sidebar, to open the Extension Manager
- Search for `@jupyter-widgets/jupyterlab-manager` (Scroll down - search results show up *below* the list of currently installed widgets!)
- Click "**Install**" below the widget's description
- Wait for the blue progress bar that appears by the search box
- You should be prompted "*A build is needed to include the latest changes*" - select "**Rebuild**"
- The progress bar should resume, and you should shortly see a "Build Complete" dialogue.
- Select "**Reload**" to reload the webpage


### Your model should now be in production as a RESTful API!

Let's evaluate our model with some example headlines...

If you struggle with the widget, you can always simply call the `classify()` function from Python.

You can be creative with your headlines!


In [102]:
from IPython import display
import ipywidgets as widgets
from keras.preprocessing.sequence import pad_sequences

def classify(text):
    """Classify a headline and print the results"""
    encoded_example = tokenizer.texts_to_sequences([text])
    # Pad documents to a max length of 40 words
    max_length = 40
    padded_example = pad_sequences(encoded_example, maxlen=max_length, padding="post")
    result = predictor.predict(padded_example.tolist())
    print(result)
    ix = np.argmax(result["predictions"])
    print(f"Predicted class: '{labels[ix]}' with confidence {result['predictions'][0][ix]:.2%}")

interaction = widgets.interact_manual(
    classify,
    text=widgets.Text(
        value="The markets were bullish after news of the merger",
        placeholder="Type a news headline...",
        description="Headline:",
        layout=widgets.Layout(width="99%"),
    )
)
interaction.widget.children[1].description = "Classify!"


interactive(children=(Text(value='The markets were bullish after news of the merger', description='Headline:',…

## Clean up

Unlike training jobs (which destroy their resources as soon as training is finished), real-time endpoint deployments provision instances until we specifically shut the endpoint down...

So let's be frugal with resources, and delete resources when we don't need them anymore:


In [None]:
sess.delete_endpoint(predictor.endpoint)


## (Optional) Automatic Hyperparameter Optimization - HPO

Rather than manually tweak parameters to tune the model performance, we can get SageMaker to help us out.

We'll simply tell SageMaker:

- The type and allowable range of each parameter,
- The metric we want to optimize for, and
- Strategy and resource constraints

...and the service will set up jobs for us to find the best combination.


### (Hyper-)Parameter Definitions


In [None]:
from sagemaker.tuner import CategoricalParameter, ContinuousParameter, HyperparameterTuner, IntegerParameter

hyperparameter_ranges = {
    "epochs": IntegerParameter(2, 7),
    "learning_rate": ContinuousParameter(0.01, 0.2),
}


### Objective Metric

'Metrics' in SageMaker are scraped from the console output of jobs, by way of regular expressions.

We can define multiple metrics to monitor, but HPO requires us to specify that exactly one of them is the **objective** metric to optimize:


In [None]:
metric_definitions = [{ "Name": "loss", "Regex": "loss: ([0-9\\.]+)" }]
objective_metric_name = "loss"
objective_type = "Minimize"


### Start the Tuning Job

We already defined our Estimator above, so we'll just re-use the configuration with minor adjustments.

Note that the Estimator's `hyperparameters` will be used as base values, and overridden by the HyperParameterTuner where appropriate.


In [None]:
# Keep per-job resources modest, so that parallel jobs don't hit any limits:
estimator.train_instance_type = "ml.c5.xlarge"
estimator.train_instance_count = 1

In [None]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    base_tuning_job_name="news-hpo-keras",
    max_jobs=6,
    max_parallel_jobs=2,
    objective_type=objective_type
)

tuner.fit(inputs)


### Check On Progress

HPO jobs can take a long time to complete, and can run multiple training jobs in parallel - each on multiple instances... Which is why the `fit()` call above doesn't wait by default, and won't show us a potentially-confusing consolidated log stream.

Go to the Training > Hyperparameter Tuning Jobs page of the [**SageMaker Console**](https://console.aws.amazon.com/sagemaker/home#/hyper-tuning-jobs) and select the job from the list.

You can see all the training jobs triggered for the HPO run, as well as overall summary metrics.

This information can be accessed via the API/SDKs too of course. For example we can wait for HPO to finish like the below:


In [None]:
import boto3
import time

# Wait until HPO is finished
hpo_state = "InProgress"
smclient = boto3.Session().client("sagemaker")

while hpo_state == "InProgress":
    hpo_state = smclient.describe_hyper_parameter_tuning_job(
        HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
    )["HyperParameterTuningJobStatus"]
    print("-", end="")
    time.sleep(60)  # Poll once every 1 min

print("\nHPO state:", hpo_state)


### Using the Model

Just like with our `estimator`, we can call `tuner.deploy()` to create an endpoint and `predictor` from the best-performing model found in the HPO run.


## Review

In this notebook, we refactored our local code to train and deploy the same Keras model using SageMaker.

The benefits of this approach are:

- We can automatically provision specialist computing resources (e.g. high-performance, or GPU-accelerated instances) for **only** the duration of the training job: Getting good performance in training, without leaving resources sitting around under-utilized
- Our trained model can be deployed to a secure, production-ready web endpoint with just one SDK call: No container or web application packaging required, unless we want to customize the behaviour
