# Document Embedding with Amazon SageMaker Object2Vec

1. [Introduction](#Introduction)
2. [Background](#Background)
  1. [Embedding documents using Object2Vec](#Embedding-documents-using-Object2Vec)
3. [Download and preprocess Wikipedia data](#Download-and-preprocess-Wikipedia-data)
  1. [Install and load dependencies](#Install-and-load-dependencies)
  2. [Build vocabulary and tokenize datasets](#Build-vocabulary-and-tokenize-datasets)
  3. [Upload preprocessed data to S3](#Upload-preprocessed-data-to-S3)
4. [Define SageMaker session, Object2Vec image, S3 input and output paths](#Define-SageMaker-session,-Object2Vec-image,-S3-input-and-output-paths)
5. [Train and deploy doc2vec](#Train-and-deploy-doc2vec)
  1. [Learning performance boost with new features](#Learning-performance-boost-with-new-features)
  2. [Training speedup with sparse gradient update](#Training-speedup-with-sparse-gradient-update)
6. [Apply learned embeddings to document retrieval task](#Apply-learned-embeddings-to-document-retrieval-task)
  1. [Comparison with the StarSpace algorithm](#Comparison-with-the-StarSpace-algorithm)

## Introduction

In this notebook, we introduce four new features to Object2Vec, a general-purpose neural embedding algorithm: negative sampling, sparse gradient update, weight-sharing, and comparator operator customization. The new features together broaden the applicability of Object2Vec, improve its training speed and accuracy, and provide users with greater flexibility. See [Introduction to the Amazon SageMaker Object2Vec](https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/) if you aren’t already familiar with Object2Vec.

We demonstrate how these new features extend the applicability of Object2Vec to a new Document Embedding use-case: A customer has a large collection of documents. Instead of storing these documents in its raw format or as sparse bag-of-words vectors, to achieve training efficiency in the various downstream tasks, she would like to instead embed all documents in a common low-dimensional space, so that the semantic distance between these documents are preserved.

## Background

Object2Vec is a highly customizable multi-purpose algorithm that can learn embeddings of pairs of objects. The embeddings are learned such that it preserves their pairwise similarities in the original space.

- Similarity is user-defined: users need to provide the algorithm with pairs of objects that they define as similar (1) or dissimilar (0); alternatively, the users can define similarity in a continuous sense (provide a real-valued similarity score).

- The learned embeddings can be used to efficiently compute nearest neighbors of objects, as well as to visualize natural clusters of related objects in the embedding space. In addition, the embeddings can also be used as features of the corresponding objects in downstream supervised tasks such as classification or regression.

### Embedding documents using Object2Vec

We demonstrate how, with the new features, Object2Vec can be used to embed a large collection of documents into vectors in the same latent space.

Similar to the widely used Word2Vec algorithm for word embedding, a natural approach to document embedding is to preprocess documents as (sentence, context) pairs, where the sentence and its matching context come from the same document. The matching context is the entire document with the given sentence removed. The idea is to embed both sentence and context into a low dimensional space such that their mutual similarity is maximized, since they belong to the same document and therefore should be semantically related. The learned encoder for the context can then be used to encode new documents into the same embedding space. In order to train the encoders for sentences and documents, we also need negative (sentence, context) pairs so that the model can learn to discriminate between semantically similar and dissimilar pairs. It is easy to generate such negatives by pairing sentences with documents that they do not belong to. Since there are many more negative pairs than positives in naturally occurring data, we typically resort to random sampling techniques to achieve a balance between positive and negative pairs in the training data. The figure below shows pictorially how the positive pairs and negative pairs are generated from unlabeled data for the purpose of learning embeddings for documents (and sentences).

We show how Object2Vec with the new *negative sampling feature* can be applied to the document embedding use-case. In addition, we show how the other new features, namely, *weight-sharing*, *customization of comparator operator*, and *sparse gradient update*, together enhance the algorithm's performance and user-experience in and beyond this use-case. Sections [Learning performance boost with new features](#Learning-performance-boost-with-new-features) and [Training speedup with sparse gradient update](#Training-speedup-with-sparse-gradient-update) in this notebook provide a detailed introduction to the new features.

## Download and preprocess Wikipedia data

Please be aware of the following requirements about the acknowledgment, copyright and availability, cited from the [data source description page](https://github.com/facebookresearch/StarSpace/blob/master/LICENSE.md).

> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

In [None]:
%%bash

DATANAME="wikipedia"
DATADIR="/tmp/wiki"

mkdir -p "${DATADIR}"

if [ ! -f "${DATADIR}/${DATANAME}_train250k.txt" ]
then
    echo "Downloading wikipedia data"
    wget --quiet -c "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.wikipedia_250k.zip" -O "${DATADIR}/${DATANAME}_train.zip"
    unzip "${DATADIR}/${DATANAME}_train.zip" -d "${DATADIR}"
fi


In [None]:
datadir = '/tmp/wiki'

In [None]:
!ls /tmp/wiki

### Install and load dependencies

In [None]:
!pip install keras tensorflow

In [None]:
import json
import os
import random
from itertools import chain
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import normalize

## sagemaker api
import sagemaker, boto3
from sagemaker.session import s3_input
from sagemaker.predictor import json_serializer, json_deserializer

### Build vocabulary and tokenize datasets

In [None]:
def load_articles(filepath):
    with open(filepath) as f:
        for line in f:
            yield map(str.split, line.strip().split('\t'))


def split_sents(article):
    return [sent.split(' ') for sent in article.split('\t')]


def build_vocab(sents):
    print('Build start...')
    tok = Tokenizer(oov_token='<UNK>', filters='')
    tok.fit_on_texts(sents)
    print('Build end...')
    return tok


def generate_positive_pairs_from_single_article(sents, tokenizer):
    sents = list(sents)
    idx = random.randrange(0, len(sents))
    center = sents.pop(idx)
    wrapper_tokens = tokenizer.texts_to_sequences(sents)
    sent_tokens = tokenizer.texts_to_sequences([center])
    wrapper_tokens = list(chain(*wrapper_tokens))
    sent_tokens = list(chain(*sent_tokens))
    yield {'in0': sent_tokens, 'in1': wrapper_tokens, 'label': 1}


def generate_positive_pairs_from_single_file(sents_per_article, tokenizer):
    iter_list = [generate_positive_pairs_from_single_article(sents, tokenizer)
                 for sents in sents_per_article
                 ]
    return chain.from_iterable(iter_list)


In [None]:
filepath = os.path.join(datadir, 'ja.wikipedia_250k.txt')
sents_per_article =  load_articles(filepath)
sents = chain(*sents_per_article)
tokenizer = build_vocab(sents)

# save
datadir = '.'
train_prefix = 'train250k'
fname = "wikipedia_{}.txt".format(train_prefix)
outfname = os.path.join(datadir, '{}_tokenized.jsonl'.format(train_prefix))
with open(outfname, 'w') as f:
    sents_per_article =  load_articles(filepath)
    for sample in generate_positive_pairs_from_single_file(sents_per_article, tokenizer):
        f.write('{}\n'.format(json.dumps(sample)))

In [None]:
# Shuffle training data
!shuf {outfname} > {train_prefix}_tokenized_shuf.jsonl

### Upload preprocessed data to S3

In [None]:
TRAIN_DATA="train250k_tokenized_shuf.jsonl"

# NOTE: define your s3 bucket and key here
S3_BUCKET = 'YOUR_BUCKET'
S3_KEY = 'object2vec-doc2vec'



In [None]:
%%bash -s "$TRAIN_DATA" "$S3_BUCKET" "$S3_KEY"

aws s3 cp "$1" s3://$2/$3/input/train/

## Define Sagemaker session, Object2Vec image, S3 input and output paths

In [None]:
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri


region = boto3.Session().region_name
print("Your notebook is running on region '{}'".format(region))

sess = sagemaker.Session()

 
role = get_execution_role()
print("Your IAM role: '{}'".format(role))

container = get_image_uri(region, 'object2vec')
print("The image uri used is '{}'".format(container))

print("Using s3 buceket: {} and key prefix: {}".format(S3_BUCKET, S3_KEY))

In [None]:
## define input channels

s3_input_path = os.path.join('s3://', S3_BUCKET, S3_KEY, 'input')

s3_train = s3_input(os.path.join(s3_input_path, 'train', TRAIN_DATA), 
                    distribution='ShardedByS3Key', content_type='application/jsonlines')

In [None]:
## define output path
output_path = os.path.join('s3://', S3_BUCKET, S3_KEY, 'models')

## Train and deploy doc2vec

We combine four new features into our training of Object2Vec:

- Negative sampling: With the new `negative_sampling_rate` hyperparameter, users of Object2Vec only need to provide positively labeled data pairs, and the algorithm automatically samples for negative data internally during training.

- Weight-sharing of embedding layer: The new `tied_token_embedding_weight` hyperparameter gives user the flexibility to share the embedding weights for both encoders, and it improves the performance of the algorithm in this use-case

- The new `comparator_list` hyperparameter gives users the flexibility to mix-and-match different operators so that they can tune the algorithm towards optimal performance for their applications.

In [None]:
# Define training hyperparameters

hyperparameters = {
      "_kvstore": "device",
      "_num_gpus": 'auto',
      "_num_kv_servers": "auto",
      "bucket_width": 0,
      "dropout": 0.4,
      "early_stopping_patience": 2,
      "early_stopping_tolerance": 0.01,
      "enc0_layers": "auto",
      "enc0_max_seq_len": 50,
      "enc0_network": "pooled_embedding",
      "enc0_pretrained_embedding_file": "",
      "enc0_token_embedding_dim": 300,
      "enc0_vocab_size": len(tokenizer.word_index) + 1,
      "enc1_network": "enc0",
      "enc_dim": 300,
      "epochs": 20,
      "learning_rate": 0.01,
      "mini_batch_size": 512,
      "mlp_activation": "relu",
      "mlp_dim": 512,
      "mlp_layers": 2,
      "num_classes": 2,
      "optimizer": "adam",
      "output_layer": "softmax",
      "weight_decay": 0
}


hyperparameters['negative_sampling_rate'] = 3
hyperparameters['tied_token_embedding_weight'] = "true"
hyperparameters['comparator_list'] = "hadamard"
hyperparameters['token_embedding_storage_type'] = 'row_sparse'

    
# get estimator
doc2vec = sagemaker.estimator.Estimator(container,
                                          role, 
                                          train_instance_count=1, 
                                          train_instance_type='ml.p2.xlarge',
                                          output_path=output_path,
                                          sagemaker_session=sess)



In [None]:
# set hyperparameters
doc2vec.set_hyperparameters(**hyperparameters)

# fit estimator with data
doc2vec.fit({'train': s3_train})
#doc2vec.fit({'train': s3_train, 'validation':s3_valid, 'test':s3_test})

In [None]:
# deploy model

doc2vec_model = doc2vec.create_model(
                        serializer=json_serializer,
                        deserializer=json_deserializer,
                        content_type='application/json')

predictor = doc2vec_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [None]:
sent = '今日 の 昼食 は うどん だっ た'
sent_tokens = tokenizer.texts_to_sequences([sent])
payload = {'instances': [{'in0': sent_tokens[0]}]}
result = predictor.predict(payload)
print(result)

In [None]:
predictor.delete_endpoint()