# Document Embedding with Amazon SageMaker Object2Vec

1. [Introduction](#Introduction)
2. [Background](#Background)
  1. [Embedding documents using Object2Vec](#Embedding-documents-using-Object2Vec)
3. [Download and preprocess Wikipedia data](#Download-and-preprocess-Wikipedia-data)
  1. [Install and load dependencies](#Install-and-load-dependencies)
  2. [Build vocabulary and tokenize datasets](#Build-vocabulary-and-tokenize-datasets)
  3. [Upload preprocessed data to S3](#Upload-preprocessed-data-to-S3)
4. [Define SageMaker session, Object2Vec image, S3 input and output paths](#Define-SageMaker-session,-Object2Vec-image,-S3-input-and-output-paths)
5. [Train and deploy doc2vec](#Train-and-deploy-doc2vec)
  1. [Learning performance boost with new features](#Learning-performance-boost-with-new-features)
  2. [Training speedup with sparse gradient update](#Training-speedup-with-sparse-gradient-update)
6. [Apply learned embeddings to document retrieval task](#Apply-learned-embeddings-to-document-retrieval-task)
  1. [Comparison with the StarSpace algorithm](#Comparison-with-the-StarSpace-algorithm)

## Introduction

In this notebook, we introduce four new features to Object2Vec, a general-purpose neural embedding algorithm: negative sampling, sparse gradient update, weight-sharing, and comparator operator customization. The new features together broaden the applicability of Object2Vec, improve its training speed and accuracy, and provide users with greater flexibility. See [Introduction to the Amazon SageMaker Object2Vec](https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/) if you aren’t already familiar with Object2Vec.

We demonstrate how these new features extend the applicability of Object2Vec to a new Document Embedding use-case: A customer has a large collection of documents. Instead of storing these documents in its raw format or as sparse bag-of-words vectors, to achieve training efficiency in the various downstream tasks, she would like to instead embed all documents in a common low-dimensional space, so that the semantic distance between these documents are preserved.

## Background

Object2Vec is a highly customizable multi-purpose algorithm that can learn embeddings of pairs of objects. The embeddings are learned such that it preserves their pairwise similarities in the original space.

- Similarity is user-defined: users need to provide the algorithm with pairs of objects that they define as similar (1) or dissimilar (0); alternatively, the users can define similarity in a continuous sense (provide a real-valued similarity score).

- The learned embeddings can be used to efficiently compute nearest neighbors of objects, as well as to visualize natural clusters of related objects in the embedding space. In addition, the embeddings can also be used as features of the corresponding objects in downstream supervised tasks such as classification or regression.

### Embedding documents using Object2Vec

We demonstrate how, with the new features, Object2Vec can be used to embed a large collection of documents into vectors in the same latent space.

Similar to the widely used Word2Vec algorithm for word embedding, a natural approach to document embedding is to preprocess documents as (sentence, context) pairs, where the sentence and its matching context come from the same document. The matching context is the entire document with the given sentence removed. The idea is to embed both sentence and context into a low dimensional space such that their mutual similarity is maximized, since they belong to the same document and therefore should be semantically related. The learned encoder for the context can then be used to encode new documents into the same embedding space. In order to train the encoders for sentences and documents, we also need negative (sentence, context) pairs so that the model can learn to discriminate between semantically similar and dissimilar pairs. It is easy to generate such negatives by pairing sentences with documents that they do not belong to. Since there are many more negative pairs than positives in naturally occurring data, we typically resort to random sampling techniques to achieve a balance between positive and negative pairs in the training data. The figure below shows pictorially how the positive pairs and negative pairs are generated from unlabeled data for the purpose of learning embeddings for documents (and sentences).

We show how Object2Vec with the new *negative sampling feature* can be applied to the document embedding use-case. In addition, we show how the other new features, namely, *weight-sharing*, *customization of comparator operator*, and *sparse gradient update*, together enhance the algorithm's performance and user-experience in and beyond this use-case. Sections [Learning performance boost with new features](#Learning-performance-boost-with-new-features) and [Training speedup with sparse gradient update](#Training-speedup-with-sparse-gradient-update) in this notebook provide a detailed introduction to the new features.

## Download and preprocess Wikipedia data

Please be aware of the following requirements about the acknowledgment, copyright and availability, cited from the [data source description page](https://github.com/facebookresearch/StarSpace/blob/master/LICENSE.md).

> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

In [1]:
%%bash

DATANAME="wikipedia"
DATADIR="/tmp/wiki"

mkdir -p "${DATADIR}"

if [ ! -f "${DATADIR}/${DATANAME}_train250k.txt" ]
then
    echo "Downloading wikipedia data"
    wget --quiet -c "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.wikipedia_250k.zip" -O "${DATADIR}/${DATANAME}_train.zip"
    unzip "${DATADIR}/${DATANAME}_train.zip" -d "${DATADIR}"
fi


Downloading wikipedia data
Archive:  /tmp/wiki/wikipedia_train.zip
  inflating: /tmp/wiki/ja.wikipedia_250k.txt  


In [27]:
datadir = '/tmp/wiki'

In [3]:
!ls /tmp/wiki

ja.wikipedia_250k.txt  wikipedia_train.zip


### Install and load dependencies

In [4]:
!pip install keras tensorflow

Collecting keras
[?25l  Downloading https://files.pythonhosted.org/packages/5e/10/aa32dad071ce52b5502266b5c659451cfd6ffcbf14e6c8c4f16c0ff5aaab/Keras-2.2.4-py2.py3-none-any.whl (312kB)
[K    100% |████████████████████████████████| 317kB 27.4MB/s ta 0:00:01
[?25hCollecting tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/77/63/a9fa76de8dffe7455304c4ed635be4aa9c0bacef6e0633d87d5f54530c5c/tensorflow-1.13.1-cp36-cp36m-manylinux1_x86_64.whl (92.5MB)
[K    100% |████████████████████████████████| 92.5MB 281kB/s eta 0:00:01
Collecting keras-applications>=1.0.6 (from keras)
[?25l  Downloading https://files.pythonhosted.org/packages/90/85/64c82949765cfb246bbdaf5aca2d55f400f792655927a017710a78445def/Keras_Applications-1.0.7-py2.py3-none-any.whl (51kB)
[K    100% |████████████████████████████████| 61kB 28.4MB/s ta 0:00:01
Collecting keras-preprocessing>=1.0.5 (from keras)
[?25l  Downloading https://files.pythonhosted.org/packages/c0/bf/0315ef6a9fd3fc2346e85b0ff1f5f83ca1

In [5]:
import json
import os
import random
from itertools import chain
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import normalize

## sagemaker api
import sagemaker, boto3
from sagemaker.session import s3_input
from sagemaker.predictor import json_serializer, json_deserializer

Using TensorFlow backend.


### Build vocabulary and tokenize datasets

In [26]:
def load_articles(filepath):
    with open(filepath) as f:
        for line in f:
            yield map(str.split, line.strip().split('\t'))


def split_sents(article):
    return [sent.split(' ') for sent in article.split('\t')]


def build_vocab(sents):
    print('Build start...')
    tok = Tokenizer(oov_token='<UNK>', filters='')
    tok.fit_on_texts(sents)
    print('Build end...')
    return tok


def generate_positive_pairs_from_single_article(sents, tokenizer):
    sents = list(sents)
    idx = random.randrange(0, len(sents))
    center = sents.pop(idx)
    wrapper_tokens = tokenizer.texts_to_sequences(sents)
    sent_tokens = tokenizer.texts_to_sequences([center])
    wrapper_tokens = list(chain(*wrapper_tokens))
    sent_tokens = list(chain(*sent_tokens))
    yield {'in0': sent_tokens, 'in1': wrapper_tokens, 'label': 1}


def generate_positive_pairs_from_single_file(sents_per_article, tokenizer):
    iter_list = [generate_positive_pairs_from_single_article(sents, tokenizer)
                 for sents in sents_per_article
                 ]
    return chain.from_iterable(iter_list)


In [28]:
filepath = os.path.join(datadir, 'ja.wikipedia_250k.txt')
sents_per_article =  load_articles(filepath)
sents = chain(*sents_per_article)
tokenizer = build_vocab(sents)

# save
datadir = '.'
train_prefix = 'train250k'
fname = "wikipedia_{}.txt".format(train_prefix)
outfname = os.path.join(datadir, '{}_tokenized.jsonl'.format(train_prefix))
with open(outfname, 'w') as f:
    sents_per_article =  load_articles(filepath)
    for sample in generate_positive_pairs_from_single_file(sents_per_article, tokenizer):
        f.write('{}\n'.format(json.dumps(sample)))

Build start...
Build end...


In [29]:
# Shuffle training data
!shuf {outfname} > {train_prefix}_tokenized_shuf.jsonl

### Upload preprocessed data to S3

In [30]:
TRAIN_DATA="train250k_tokenized_shuf.jsonl"

# NOTE: define your s3 bucket and key here
S3_BUCKET = 'YOUR_BUCKET'
S3_KEY = 'object2vec-doc2vec'



In [31]:
%%bash -s "$TRAIN_DATA" "$S3_BUCKET" "$S3_KEY"

aws s3 cp "$1" s3://$2/$3/input/train/

Completed 256.0 KiB/531.9 MiB (1.5 MiB/s) with 1 file(s) remainingCompleted 512.0 KiB/531.9 MiB (2.8 MiB/s) with 1 file(s) remainingCompleted 768.0 KiB/531.9 MiB (4.1 MiB/s) with 1 file(s) remainingCompleted 1.0 MiB/531.9 MiB (5.4 MiB/s) with 1 file(s) remaining  Completed 1.2 MiB/531.9 MiB (6.4 MiB/s) with 1 file(s) remaining  Completed 1.5 MiB/531.9 MiB (7.5 MiB/s) with 1 file(s) remaining  Completed 1.8 MiB/531.9 MiB (8.6 MiB/s) with 1 file(s) remaining  Completed 2.0 MiB/531.9 MiB (9.6 MiB/s) with 1 file(s) remaining  Completed 2.2 MiB/531.9 MiB (10.7 MiB/s) with 1 file(s) remaining Completed 2.5 MiB/531.9 MiB (11.8 MiB/s) with 1 file(s) remaining Completed 2.8 MiB/531.9 MiB (12.9 MiB/s) with 1 file(s) remaining Completed 3.0 MiB/531.9 MiB (13.9 MiB/s) with 1 file(s) remaining Completed 3.2 MiB/531.9 MiB (14.9 MiB/s) with 1 file(s) remaining Completed 3.5 MiB/531.9 MiB (15.8 MiB/s) with 1 file(s) remaining Completed 3.8 MiB/531.9 MiB (16.7 MiB/s) with 1 file(s) remain

## Define Sagemaker session, Object2Vec image, S3 input and output paths

In [32]:
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri


region = boto3.Session().region_name
print("Your notebook is running on region '{}'".format(region))

sess = sagemaker.Session()

 
role = get_execution_role()
print("Your IAM role: '{}'".format(role))

container = get_image_uri(region, 'object2vec')
print("The image uri used is '{}'".format(container))

print("Using s3 buceket: {} and key prefix: {}".format(S3_BUCKET, S3_KEY))

Your notebook is running on region 'us-east-1'
Your IAM role: 'arn:aws:iam::937428301455:role/AmazonSageMaker-ExecutionRole-20180627'
The image uri used is '382416733822.dkr.ecr.us-east-1.amazonaws.com/object2vec:1'
Using s3 buceket: sagemaker.tech-sketch.jp and key prefix: object2vec-doc2vec


In [33]:
## define input channels

s3_input_path = os.path.join('s3://', S3_BUCKET, S3_KEY, 'input')

s3_train = s3_input(os.path.join(s3_input_path, 'train', TRAIN_DATA), 
                    distribution='ShardedByS3Key', content_type='application/jsonlines')

"\ns3_valid = s3_input(os.path.join(s3_input_path, 'validation', DEV_DATA), \n                    distribution='ShardedByS3Key', content_type='application/jsonlines')\n\ns3_test = s3_input(os.path.join(s3_input_path, 'test', TEST_DATA), \n                   distribution='ShardedByS3Key', content_type='application/jsonlines')\n"

In [34]:
## define output path
output_path = os.path.join('s3://', S3_BUCKET, S3_KEY, 'models')

## Train and deploy doc2vec

We combine four new features into our training of Object2Vec:

- Negative sampling: With the new `negative_sampling_rate` hyperparameter, users of Object2Vec only need to provide positively labeled data pairs, and the algorithm automatically samples for negative data internally during training.

- Weight-sharing of embedding layer: The new `tied_token_embedding_weight` hyperparameter gives user the flexibility to share the embedding weights for both encoders, and it improves the performance of the algorithm in this use-case

- The new `comparator_list` hyperparameter gives users the flexibility to mix-and-match different operators so that they can tune the algorithm towards optimal performance for their applications.

In [39]:
# Define training hyperparameters

hyperparameters = {
      "_kvstore": "device",
      "_num_gpus": 'auto',
      "_num_kv_servers": "auto",
      "bucket_width": 0,
      "dropout": 0.4,
      "early_stopping_patience": 2,
      "early_stopping_tolerance": 0.01,
      "enc0_layers": "auto",
      "enc0_max_seq_len": 50,
      "enc0_network": "pooled_embedding",
      "enc0_pretrained_embedding_file": "",
      "enc0_token_embedding_dim": 300,
      "enc0_vocab_size": len(tokenizer.word_index) + 1,
      "enc1_network": "enc0",
      "enc_dim": 300,
      "epochs": 20,
      "learning_rate": 0.01,
      "mini_batch_size": 512,
      "mlp_activation": "relu",
      "mlp_dim": 512,
      "mlp_layers": 2,
      "num_classes": 2,
      "optimizer": "adam",
      "output_layer": "softmax",
      "weight_decay": 0
}


hyperparameters['negative_sampling_rate'] = 3
hyperparameters['tied_token_embedding_weight'] = "true"
hyperparameters['comparator_list'] = "hadamard"
hyperparameters['token_embedding_storage_type'] = 'row_sparse'

    
# get estimator
doc2vec = sagemaker.estimator.Estimator(container,
                                          role, 
                                          train_instance_count=1, 
                                          train_instance_type='ml.p2.xlarge',
                                          output_path=output_path,
                                          sagemaker_session=sess)



In [40]:
# set hyperparameters
doc2vec.set_hyperparameters(**hyperparameters)

# fit estimator with data
doc2vec.fit({'train': s3_train})
#doc2vec.fit({'train': s3_train, 'validation':s3_valid, 'test':s3_test})

2019-05-10 02:17:43 Starting - Starting the training job...
2019-05-10 02:17:44 Starting - Launching requested ML instances......
2019-05-10 02:18:55 Starting - Preparing the instances for training......
2019-05-10 02:20:09 Downloading - Downloading input data
2019-05-10 02:20:09 Training - Downloading the training image......
2019-05-10 02:21:01 Training - Training image download completed. Training in progress.
[31mDocker entrypoint called with argument(s): train[0m
[31m[05/10/2019 02:21:04 INFO 140711520630592] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'comparator_list': u'hadamard, concat, abs_diff', u'enc0_vocab_file': u'', u'output_layer': u'softmax', u'enc0_cnn_filter_width': 3, u'epochs': 30, u'mlp_dim': 512, u'enc0_freeze_pretrained_embedding': u'true', u'mlp_layers': 2, u'_num_kv_servers': u'auto', u'weight_decay': 0, u'enc0_pretrained_embedding_file': u'', u'token_embedding_storage_type': u'dense', u'enc0_to

[31m[05/10/2019 02:22:44 INFO 140711520630592] Epoch: 0, batches: 100, num_examples: 51200, 13140.5 samples/sec, epoch time so far: 0:00:03.896361[0m
[31m[05/10/2019 02:22:44 INFO 140711520630592] #011Training metrics: perplexity: 1.683 cross_entropy: 0.521 accuracy: 0.753 [0m
[31m[05/10/2019 02:22:47 INFO 140711520630592] Epoch: 0, batches: 200, num_examples: 102400, 13826.4 samples/sec, epoch time so far: 0:00:07.406123[0m
[31m[05/10/2019 02:22:47 INFO 140711520630592] #011Training metrics: perplexity: 1.583 cross_entropy: 0.459 accuracy: 0.784 [0m
[31m[05/10/2019 02:22:51 INFO 140711520630592] Epoch: 0, batches: 300, num_examples: 153600, 14048.9 samples/sec, epoch time so far: 0:00:10.933244[0m
[31m[05/10/2019 02:22:51 INFO 140711520630592] #011Training metrics: perplexity: 1.529 cross_entropy: 0.425 accuracy: 0.803 [0m
[31m[05/10/2019 02:22:55 INFO 140711520630592] Epoch: 0, batches: 400, num_examples: 204800, 14173.1 samples/sec, epoch time so far: 0:00:14.449890[0m

[31m[05/10/2019 02:24:15 INFO 140711520630592] Epoch: 1, batches: 900, num_examples: 460800, 14545.2 samples/sec, epoch time so far: 0:00:31.680476[0m
[31m[05/10/2019 02:24:15 INFO 140711520630592] #011Training metrics: perplexity: 1.222 cross_entropy: 0.201 accuracy: 0.921 [0m
[31m[05/10/2019 02:24:19 INFO 140711520630592] Epoch: 1, batches: 1000, num_examples: 512000, 14550.9 samples/sec, epoch time so far: 0:00:35.186891[0m
[31m[05/10/2019 02:24:19 INFO 140711520630592] #011Training metrics: perplexity: 1.222 cross_entropy: 0.201 accuracy: 0.921 [0m
[31m[05/10/2019 02:24:22 INFO 140711520630592] Epoch: 1, batches: 1100, num_examples: 563200, 14560.4 samples/sec, epoch time so far: 0:00:38.680367[0m
[31m[05/10/2019 02:24:22 INFO 140711520630592] #011Training metrics: perplexity: 1.223 cross_entropy: 0.201 accuracy: 0.921 [0m
[31m[05/10/2019 02:24:26 INFO 140711520630592] Epoch: 1, batches: 1200, num_examples: 614400, 14555.2 samples/sec, epoch time so far: 0:00:42.211776

[31m[05/10/2019 02:25:46 INFO 140711520630592] Epoch: 2, batches: 1700, num_examples: 870400, 14561.2 samples/sec, epoch time so far: 0:00:59.775215[0m
[31m[05/10/2019 02:25:46 INFO 140711520630592] #011Training metrics: perplexity: 1.177 cross_entropy: 0.163 accuracy: 0.938 [0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] **************[0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] Completed Epoch: 2, time taken: 0:01:01.591944[0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] Epoch 2 Training metrics:   perplexity: 1.177 cross_entropy: 0.163 accuracy: 0.938 [0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] #quality_metric: host=algo-1, epoch=2, train cross_entropy <loss>=0.162737844661[0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] #quality_metric: host=algo-1, epoch=2, train accuracy <score>=0.937661645619[0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] **************[0m
[31m[05/10/2019 02:25:48 INFO 140711520630592] patience losses: [0.31073039042533912

[31m[05/10/2019 02:26:55 INFO 140711520630592] Epoch: 4, batches: 100, num_examples: 51200, 14678.1 samples/sec, epoch time so far: 0:00:03.488199[0m
[31m[05/10/2019 02:26:55 INFO 140711520630592] #011Training metrics: perplexity: 1.128 cross_entropy: 0.120 accuracy: 0.955 [0m
[31m[05/10/2019 02:26:58 INFO 140711520630592] Epoch: 4, batches: 200, num_examples: 102400, 14591.7 samples/sec, epoch time so far: 0:00:07.017687[0m
[31m[05/10/2019 02:26:58 INFO 140711520630592] #011Training metrics: perplexity: 1.128 cross_entropy: 0.121 accuracy: 0.955 [0m
[31m[05/10/2019 02:27:02 INFO 140711520630592] Epoch: 4, batches: 300, num_examples: 153600, 14560.7 samples/sec, epoch time so far: 0:00:10.548945[0m
[31m[05/10/2019 02:27:02 INFO 140711520630592] #011Training metrics: perplexity: 1.129 cross_entropy: 0.121 accuracy: 0.955 [0m
[31m[05/10/2019 02:27:05 INFO 140711520630592] Epoch: 4, batches: 400, num_examples: 204800, 14544.2 samples/sec, epoch time so far: 0:00:14.081252[0m

[31m[05/10/2019 02:28:18 INFO 140711520630592] Epoch: 5, batches: 700, num_examples: 358400, 14551.0 samples/sec, epoch time so far: 0:00:24.630578[0m
[31m[05/10/2019 02:28:18 INFO 140711520630592] #011Training metrics: perplexity: 1.123 cross_entropy: 0.116 accuracy: 0.958 [0m
[31m[05/10/2019 02:28:22 INFO 140711520630592] Epoch: 5, batches: 800, num_examples: 409600, 14565.4 samples/sec, epoch time so far: 0:00:28.121504[0m
[31m[05/10/2019 02:28:22 INFO 140711520630592] #011Training metrics: perplexity: 1.123 cross_entropy: 0.116 accuracy: 0.958 [0m
[31m[05/10/2019 02:28:25 INFO 140711520630592] Epoch: 5, batches: 900, num_examples: 460800, 14566.9 samples/sec, epoch time so far: 0:00:31.633262[0m
[31m[05/10/2019 02:28:25 INFO 140711520630592] #011Training metrics: perplexity: 1.124 cross_entropy: 0.117 accuracy: 0.958 [0m
[31m[05/10/2019 02:28:29 INFO 140711520630592] Epoch: 5, batches: 1000, num_examples: 512000, 14563.0 samples/sec, epoch time so far: 0:00:35.157647[

[31m[05/10/2019 02:29:35 INFO 140711520630592] Epoch: 6, batches: 1100, num_examples: 563200, 14496.9 samples/sec, epoch time so far: 0:00:38.849590[0m
[31m[05/10/2019 02:29:35 INFO 140711520630592] #011Training metrics: perplexity: 1.115 cross_entropy: 0.109 accuracy: 0.961 [0m
[31m[05/10/2019 02:29:39 INFO 140711520630592] Epoch: 6, batches: 1200, num_examples: 614400, 14498.6 samples/sec, epoch time so far: 0:00:42.376367[0m
[31m[05/10/2019 02:29:39 INFO 140711520630592] #011Training metrics: perplexity: 1.115 cross_entropy: 0.109 accuracy: 0.961 [0m
[31m[05/10/2019 02:29:42 INFO 140711520630592] Epoch: 6, batches: 1300, num_examples: 665600, 14497.5 samples/sec, epoch time so far: 0:00:45.911430[0m
[31m[05/10/2019 02:29:42 INFO 140711520630592] #011Training metrics: perplexity: 1.116 cross_entropy: 0.110 accuracy: 0.961 [0m
[31m[05/10/2019 02:29:46 INFO 140711520630592] Epoch: 6, batches: 1400, num_examples: 716800, 14501.7 samples/sec, epoch time so far: 0:00:49.42862

[31m[05/10/2019 02:30:59 INFO 140711520630592] Epoch: 7, batches: 1700, num_examples: 870400, 14578.0 samples/sec, epoch time so far: 0:00:59.706406[0m
[31m[05/10/2019 02:30:59 INFO 140711520630592] #011Training metrics: perplexity: 1.110 cross_entropy: 0.105 accuracy: 0.963 [0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] **************[0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] Completed Epoch: 7, time taken: 0:01:01.544594[0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] Epoch 7 Training metrics:   perplexity: 1.110 cross_entropy: 0.105 accuracy: 0.963 [0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] #quality_metric: host=algo-1, epoch=7, train cross_entropy <loss>=0.104857253555[0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] #quality_metric: host=algo-1, epoch=7, train accuracy <score>=0.963013252711[0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] **************[0m
[31m[05/10/2019 02:31:01 INFO 140711520630592] patience losses: [0.1186596917320196,

In [41]:
# deploy model

doc2vec_model = doc2vec.create_model(
                        serializer=json_serializer,
                        deserializer=json_deserializer,
                        content_type='application/json')

predictor = doc2vec_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------------------!

In [51]:
sent = '今日 の 昼食 は うどん だっ た'
sent_tokens = tokenizer.texts_to_sequences([sent])
payload = {'instances': [{'in0': sent_tokens[0]}]}
result = predictor.predict(payload)
print(result)

{'predictions': [{'embeddings': [-0.377080857753754, -0.130298480391502, -0.00400310754776, -0.234255239367485, 0.019936313852668, 0.091438762843609, -0.028139429166913, 0.097819216549397, -0.234084948897362, 0.019365105777979, 0.243991538882256, -0.031482722610235, -0.074961595237255, 0.005461027380079, 0.009248143993318, -0.241194263100624, -0.258727222681046, -0.084101520478725, 0.019796488806605, -0.020008843392134, 0.06386973708868, 0.199357435107231, 0.161249384284019, -0.24904477596283, 0.118158802390099, 0.022969206795096, -0.159567832946777, 0.360994398593903, -0.56401914358139, 0.184895858168602, -0.021323382854462, 0.428308039903641, -0.141494899988174, 0.007890330627561, 0.356158673763275, 0.055441379547119, 0.126019239425659, -0.170658618211746, -0.029620936140418, 0.064398549497128, 0.116877898573875, -0.190469399094582, -0.059230502694845, 0.216100350022316, -0.021844832226634, 0.278941422700882, 0.026624957099557, 0.287674874067306, -0.404219001531601, -0.34963721036911

In [None]:
predictor.delete_endpoint()