## **Introduction**

In an investment organization, analysts read news articles to help make investment
decisions. To streamline the process of selecting articles a model is needed to select
semantic information from the data, identify similar articles from the set and then
provide the recommendations to analysts on what articles to read next. A
20newsgroups dataset containing 20 topics in approximately 20,000 documents and
AWS services such as: Amazon SageMaker, Amazon SageMaker Notebooks,
Amazon SageMaker Built-in Algorithms, AWS SDK for Python 3, Amazon S3,
WordPress, and Amazon LightSail will be used to develop the model and evaluate the
results.

## **Dataset**

The ML model will be trained on the 20newsgroups dataset - a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20newsgroups dataset is curated by Carnegie Mellon University School of Computer Science and publicly available from scikit-learn.

## **Process**

The scikit-learn dataset is naturally broken down into a training set and test set. Using APIs provided by scikit-learn the dataset is striped of headers, footers, and quotes to preprocess the raw text data into machine readable numeric values. 

As shown below, the entries in the dataset are plain text paragraphs. They will be processes into a suitable data format.


In [2]:
pip install "scikit_learn==0.22.2.post1"

Collecting scikit_learn==0.22.2.post1
  Downloading scikit_learn-0.22.2.post1-cp36-cp36m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 11.1 MB/s eta 0:00:01
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.24.1
    Uninstalling scikit-learn-0.24.1:
      Successfully uninstalled scikit-learn-0.24.1
Successfully installed scikit-learn-0.22.2.post1
Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import os
import matplotlib.pyplot as plt
import sagemaker
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_header, strip_newsgroup_quoting, strip_newsgroup_footer
newsgroups_train = fetch_20newsgroups(subset='train')['data']
newsgroups_test = fetch_20newsgroups(subset = 'test')['data']
NUM_TOPICS = 30
NUM_NEIGHBORS = 10
BUCKET = 'sagemaker-aw'
PREFIX = '20newsgroups'

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [2]:
for i in range(len(newsgroups_train)):
    newsgroups_train[i] = strip_newsgroup_header(newsgroups_train[i])
    newsgroups_train[i] = strip_newsgroup_quoting(newsgroups_train[i])
    newsgroups_train[i] = strip_newsgroup_footer(newsgroups_train[i])

In [3]:
newsgroups_train[1]

"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks."

### Bag of Words

For the data to be machine readable it needs to be tokenized to a numeric format by assigning a token (integer id) to each word in a sentence. The total number of tokens can be limited to 2000 by counting the most frequent tokens and retaining the top 2000. This could be done because less frequent words can be ignored because they will have a diminishing impact on the topic model. 

A Bags of Words (BoW) model can be used to convert the documents into a vector to keep track of the amount of times a token appears in the training example. WordNetLemmatizer, from the nltk package, and CountVectorizer are used to token count. Lemmatization aims to return actual words by using nouns for lemmatizing words into lemmas. 

The rule to consider only words longer than two characters, that start with a letter, and match the token pattern is used. 


In [4]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
import re
token_pattern = re.compile(r"(?u)\b\w\w+\b")
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if len(t) >= 2 and re.match("[a-z].*",t) 
                and re.match(token_pattern, t)]



[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The CountVectorizer API uses three hyperparameters to help train the model.
* Max_features – used to set the vocabulary size. Limited to 2000 because a large vocabulary with infrequent words would add noise to the data.
* Max_df – ignores words that occur in more than max_df% of documents. Ensures most extremely frequent words are removed. Removes the few words that occur in all of the documents
* Min_dif – ignores words that occur less than min_dif% of documents. Ensures that extremely rare words are not included. 

The BoW vectors are shuffled to generate training and validation sets.


In [5]:
import time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vocab_size = 2000
print('Tokenizing and counting, this may take a few minutes...')
start_time = time.time()
vectorizer = CountVectorizer(input='content', analyzer='word', stop_words='english',
                             tokenizer=LemmaTokenizer(), max_features=vocab_size, max_df=0.95, min_df=2)
vectors = vectorizer.fit_transform(newsgroups_train)
vocab_list = vectorizer.get_feature_names()
print('vocab size:', len(vocab_list))

# random shuffle
idx = np.arange(vectors.shape[0])
newidx = np.random.permutation(idx) # this will be the labels fed into the KNN model for training
# Need to store these permutations:

vectors = vectors[newidx]

print('Done. Time elapsed: {:.2f}s'.format(time.time() - start_time))

Tokenizing and counting, this may take a few minutes...


  'stop_words.' % sorted(inconsistent))


vocab size: 2000
Done. Time elapsed: 38.90s


Because all the parameters in the NTM model are np.float32 type the input data needs to also be in np.float32.

In [6]:
import scipy.sparse as sparse
vectors = sparse.csr_matrix(vectors, dtype=np.float32)
print(type(vectors), vectors.dtype)

<class 'scipy.sparse.csr.csr_matrix'> float32


Now the data is split into training data (80%) and testing data (20%).

In [7]:
# Convert data into training and validation data
n_train = int(0.8 * vectors.shape[0])

# split train and test
train_vectors = vectors[:n_train, :]
val_vectors = vectors[n_train:, :]

# further split test set into validation set (val_vectors) and test  set (test_vectors)

print(train_vectors.shape,val_vectors.shape)

(9051, 2000) (2263, 2000)


The training, validation and output paths are defined and data uploaded to S3 bucket for the model to access during training. 
Write_spmatrix_to_sparse_tensor is used to convert scipy sparse matrix (raw vectors) into RcordIO Protobuf format to upload to the S3 bucket. 

In [8]:
from sagemaker import get_execution_role

role = get_execution_role()

bucket = BUCKET
prefix = PREFIX

train_prefix = os.path.join(prefix, 'train')
val_prefix = os.path.join(prefix, 'val')
output_prefix = os.path.join(prefix, 'output')

s3_train_data = os.path.join('s3://', bucket, train_prefix)
s3_val_data = os.path.join('s3://', bucket, val_prefix)
output_path = os.path.join('s3://', bucket, output_prefix)
print('Training set location', s3_train_data)
print('Validation set location', s3_val_data)
print('Trained model will be saved at', output_path)

Training set location s3://sagemaker-aw/20newsgroups/train
Validation set location s3://sagemaker-aw/20newsgroups/val
Trained model will be saved at s3://sagemaker-aw/20newsgroups/output


In [9]:
def split_convert_upload(sparray, bucket, prefix, fname_template='data_part{}.pbr', n_parts=2):
    import io
    import boto3
    import sagemaker.amazon.common as smac
    
    chunk_size = sparray.shape[0]// n_parts
    for i in range(n_parts):

        # Calculate start and end indices
        start = i*chunk_size
        end = (i+1)*chunk_size
        if i+1 == n_parts:
            end = sparray.shape[0]
        
        # Convert to record protobuf
        buf = io.BytesIO()
        smac.write_spmatrix_to_sparse_tensor(array=sparray[start:end], file=buf, labels=None)
        buf.seek(0)
        
        # Upload to s3 location specified by bucket and prefix
        fname = os.path.join(prefix, fname_template.format(i))
        boto3.resource('s3').Bucket(bucket).Object(fname).upload_fileobj(buf)
        print('Uploaded data to s3://{}'.format(os.path.join(bucket, fname)))
split_convert_upload(train_vectors, bucket=bucket, prefix=train_prefix, fname_template='train_part{}.pbr', n_parts=8)
split_convert_upload(val_vectors, bucket=bucket, prefix=val_prefix, fname_template='val_part{}.pbr', n_parts=1)

Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part0.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part1.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part2.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part3.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part4.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part5.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part6.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/train/train_part7.pbr
Uploaded data to s3://sagemaker-aw/20newsgroups/val/val_part0.pbr


To use the built-in SageMaker algorithms the location of the NGM container in ECR needs to be specified. 

In [13]:
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'ntm')

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


In the API call to sagemaker.estimator.Estimator the type and count of instances is specified for the training job. Since the dataset is small a CPU instance is used.

In [14]:
sess = sagemaker.Session()
ntm = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=2, 
                                    instance_type='ml.c4.xlarge',
                                    output_path=output_path,
                                    sagemaker_session=sess)

Now the hyperparameters for the topic model are to be set. 
* num_topics – number of topics to be extracted
* feature_dim – set to the vocabulary size
* mini_batch_size – batch size for each worker instance
* epochs – max number of epochs to train for
* num_patience_epochs / tolerance – controls early stopping behavior. Improvements smaller than tolerance will be considered non improvement and the algorithm will stop training if within the last epochs there were no imporvemnts on validation loss


In [15]:
ntm.set_hyperparameters(num_topics=NUM_TOPICS, feature_dim=vocab_size, mini_batch_size=128, 
                        epochs=100, num_patience_epochs=5, tolerance=0.001)

To have each worker go through a different portion of the full dataset to provide different gradients within epochs the distribution is specified to be ShardedByS3Key for the training data channel. 

In [16]:
from sagemaker.session import s3_input
s3_train = s3_input(s3_train_data, distribution='ShardedByS3Key') 
ntm.fit({'train': s3_train, 'test': s3_val_data})

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-06-14 05:59:42 Starting - Starting the training job...
2021-06-14 05:59:54 Starting - Launching requested ML instancesProfilerReport-1623650381: InProgress
......
2021-06-14 06:01:05 Starting - Preparing the instances for training......
2021-06-14 06:02:11 Downloading - Downloading input data...
2021-06-14 06:02:39 Training - Downloading the training image...
2021-06-14 06:03:03 Training - Training image download completed. Training in progress.[35mDocker entrypoint called with argument(s): train[0m
[35mRunning default environment configuration script[0m
  from collections import Mapping, MutableMapping, Sequence[0m
[35m[06/14/2021 06:03:04 INFO 140618438534976] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/default-input.json: {'encoder_layers': 'auto', 'mini_batch_size': '256', 'epochs': '50', 'encoder_layers_activation': 'sigmoid', 'optimizer': 'adadelta', 'tolerance': '0.001', 'num_patience_epochs': '3', 'batch_norm': 'false', 'resca

To deploy the model the deploy method of sagemaker.estimator.Estimator object is called. The number and type of LM instances used to host the endpoint is specified. The deployable model is made and the endpoint is launched.
To run instances the input payload is serialized and inference output is deserialized.


In [17]:
ntm_predictor = ntm.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

----------------!

In [19]:
from sagemaker.predictor import CSVSerializer, JSONDeserializer

ntm_predictor.serializer = CSVSerializer()
ntm_predictor.deserializer = JSONDeserializer()

In [20]:
predictions = []
for item in np.array(vectors.todense()):
    np.shape(item)
    results = ntm_predictor.predict(item)
    predictions.append(np.array([prediction['topic_weights'] for prediction in results['predictions']]))
    
predictions = np.array([np.ndarray.flatten(x) for x in predictions])
topicvec = train_labels[newidx]
topicnames = [categories[x] for x in topicvec]

NameError: name 'train_labels' is not defined

A dictionary linking the shuffled labels to the original labels is created and the training data is stored in the S3 bucket.

In [22]:
labels = newidx 
labeldict = dict(zip(newidx,idx))

In [23]:
import io
import sagemaker.amazon.common as smac


print('train_features shape = ', predictions.shape)
print('train_labels shape = ', labels.shape)
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, predictions, labels)
buf.seek(0)

bucket = BUCKET
prefix = PREFIX
key = 'knn/train'
fname = os.path.join(prefix, key)
print(fname)
boto3.resource('s3').Bucket(bucket).Object(fname).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

train_features shape =  (11314, 30)
train_labels shape =  (11314,)
20newsgroups/knn/train
uploaded training data location: s3://sagemaker-aw/20newsgroups/knn/train


The helper function is used to create a k-NN estimator with index_metric set to cosine to us cosine similarity for computing the nearest neighbors.

In [24]:
def trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path, s3_test_data=None):
    """
    Create an Estimator from the given hyperparams, fit to training data, 
    and return a deployed predictor
    
    """
    # set up the estimator
    knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
        get_execution_role(),
        train_instance_count=1,
        train_instance_type='ml.c4.xlarge',
        output_path=output_path,
        sagemaker_session=sagemaker.Session())
    knn.set_hyperparameters(**hyperparams)
    
    # train a model. fit_input contains the locations of the train and test data
    fit_input = {'train': s3_train_data}
    knn.fit(fit_input)
    return knn

hyperparams = {
    'feature_dim': predictions.shape[1],
    'k': NUM_NEIGHBORS,
    'sample_size': predictions.shape[0],
    'predictor_type': 'classifier' ,
    'index_metric':'COSINE'
}
output_path = 's3://' + bucket + '/' + prefix + '/knn/output'
knn_estimator = trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path)

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-06-14 06:42:03 Starting - Starting the training job...
2021-06-14 06:42:27 Starting - Launching requested ML instancesProfilerReport-1623652923: InProgress
......
2021-06-14 06:43:27 Starting - Preparing the instances for training......
2021-06-14 06:44:27 Downloading - Downloading input data
2021-06-14 06:44:27 Training - Downloading the training image..............[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/14/2021 06:46:43 INFO 139840775472960] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'_kvstore': 'dist_async', '_log_level': 'info', '_num_gpus': 'auto', '_num_kv_servers': '1', '_tuning_objective_metric': '', '_faiss_index_nprobe': '5', 'epochs': '1', 'feature_dim': 'auto', 'faiss_index_ivf_nlists': 'auto', 'index_metric': 'L2', 'index_type': 'faiss.Flat', 'mini_batch_size': '5000', '_enable_profiler': 'false'}[0m
[34m[06/14/

Endpoint is launched for k-NN model which will return all the cosine distances. Test data is preprocesses to run inferences.