### Introduction

This notebook consists of a few different sections, each supporting a different step in the creation of a serverless wine recommender API:

1. Preprocess wine review dataset
2. Train wine word embeddings using BlazingText model
3. Generate lookup table with wine word embeddings from trained Blazingtext model
4. Convert wine reviews to wine embeddings
5. Train Nearest Neighbors model on wine embeddings

First, we need to install the necessary libraries. We also need to run a pip install of the gensim library to access it from within Sagemaker.

In [1]:
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/d3/4b/19eecdf07d614665fa889857dc56ac965631c7bd816c3476d2f0cac6ea3b/gensim-3.7.3-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)
[K    100% |████████████████████████████████| 24.2MB 2.1MB/s eta 0:00:011
[?25hCollecting smart-open>=1.7.0 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/37/c0/25d19badc495428dec6a4bf7782de617ee0246a9211af75b302a2681dea7/smart_open-1.8.4.tar.gz (63kB)
[K    100% |████████████████████████████████| 71kB 43.6MB/s ta 0:00:01
Building wheels for collected packages: smart-open
  Running setup.py bdist_wheel for smart-open ... [?25ldone
[?25h  Stored in directory: /home/ec2-user/.cache/pip/wheels/5f/ea/fb/5b1a947b369724063b2617011f1540c44eb00e28c3d2ca8692
Successfully built smart-open
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.7.3 smart-open-1.8.4
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You

In [2]:
import boto3
import os
import sagemaker
from sagemaker import get_execution_role
import pandas as pd

import numpy as np
import string
from operator import itemgetter
from collections import Counter, OrderedDict

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### 1. Preprocess the wine review dataset

To preprocess the text in the wine reviews, we firs tneed to load the full dataset from the S3 bucket in which it has been stored.

In [3]:
role = get_execution_role()
bucket='data-science-wine-reviews'
data_key = 'full_wine_dataset.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

wine_dataset = pd.read_csv(data_location, low_memory=False)
wine_dataset_relevant = wine_dataset[['Name', 'Description']]
wine_dataset_relevant.head(10)

Unnamed: 0,Name,Description
0,Nicosia 2013 Vulkà Bianco (Etna),"Aromas include tropical fruit, broom, brimston..."
1,Quinta dos Avidagos 2011 Avidagos Red (Douro),"This is ripe and fruity, a wine that is smooth..."
2,Rainstorm 2013 Pinot Gris (Willamette Valley),"Tart and snappy, the flavors of lime flesh and..."
3,St. Julian 2013 Reserve Late Harvest Riesling ...,"Pineapple rind, lemon pith and orange blossom ..."
4,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,"Much like the regular bottling from 2012, this..."
5,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Blackberry and raspberry aromas show a typical...
6,Terre di Giurfo 2013 Belsito Frappato (Vittoria),"Here's a bright, informal red that opens with ..."
7,Trimbach 2012 Gewurztraminer (Alsace),This dry and restrained wine offers spice in p...
8,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Savory dried thyme notes accent sunnier flavor...
9,Jean-Baptiste Adam 2012 Les Natures Pinot Gris...,This has great depth of flavor with its fresh ...


To build vector representations of our wine reviews, we first need to train word embeddings on the various wine terms in our corpus. This involved a couple of steps:

- concatenate all the wine reviews in the corpus
- tokenize into sentences
- remove stopwords
- convert to lower case
- remove special characters
- stemming

In [4]:
reviews_list = list(wine_dataset_relevant['Description'])
reviews_list = [str(r) for r in reviews_list]
full_corpus = ' '.join(reviews_list)
sentences_tokenized = sent_tokenize(full_corpus)

stop_words = set(stopwords.words('english')) 

punctuation_table = str.maketrans({key: None for key in string.punctuation})
sno = SnowballStemmer('english')

def normalize_text(raw_text):
    try:
        word_list = word_tokenize(raw_text)
        normalized_sentence = []
        for w in word_list:
            try:
                w = str(w)
                lower_case_word = str.lower(w)
                stemmed_word = sno.stem(lower_case_word)
                no_punctuation = stemmed_word.translate(punctuation_table)
                if len(no_punctuation) > 1 and no_punctuation not in stop_words:
                    normalized_sentence.append(no_punctuation)
            except:
                continue
        return normalized_sentence
    except:
        return ''

In [5]:
# sentence_sample = sentences_tokenized[:10]
normalized_sentences = []
for s in sentences_tokenized:
    normalized_text = normalize_text(s)
    normalized_sentences.append(normalized_text)

Next, we use the gensim Phrases package to extract bi- and tri-grams from the corpus.

In [6]:
phrases = Phrases(normalized_sentences)
phrases = Phrases(phrases[normalized_sentences])

ngrams = Phraser(phrases)

phrased_sentences = []
for sent in normalized_sentences:
    phrased_sentence = ngrams[sent]
    phrased_sentences.append(phrased_sentence)

full_list_words = [item for sublist in phrased_sentences for item in sublist]

In a previous chapter of this work, we have mapped commonly appearing and semantically meaningful words, bi-grams and tri-grams from wine reviews to a standardized set of wine descriptors. We will now apply this mapping to the corpus.

In [7]:
descriptor_mapping = pd.read_csv('s3://{}/descriptor_mapping.csv'.format(bucket)).set_index('raw descriptor')

sess = sagemaker.Session()

def return_mapped_descriptor(word):
    if word in list(descriptor_mapping.index):
        normalized_word = descriptor_mapping['level_3'][word]
        return normalized_word
    else:
        return word

normalized_sentences = []
for sent in phrased_sentences:
    normalized_sentence = []
    for word in sent:
        normalized_word = return_mapped_descriptor(word)
        normalized_sentence.append(str(normalized_word))
    normalized_sentence.append('.')
    normalized_sentence_concat = ' '.join(normalized_sentence)
    normalized_sentences.append(normalized_sentence_concat)

To train the BlazingText algorithm in Sagemaker, the training data needs to be stored in a .txt file. We will write our normalized corpus to a .txt file and save this in our S3 bucket.

In [8]:
with open('wine_corpus.txt', 'w') as f:
    for item in normalized_sentences:
        f.write("{}\n".format(item))

boto3.Session().resource('s3').Bucket(bucket).Object('wine-corpus.txt').upload_file('wine_corpus.txt')

### 2. Train wine word embeddings using BlazingText model

Now that the training data has been prepared, we can turn our attention to training the BlazingText model. We need to define a location for the training data and an output location for the model. We also need to define a container for the Blazingtext algorithm.

In [9]:
train_data = 's3://{}/wine-corpus.txt'.format(bucket)
s3_output_location = 's3://{}/output'.format(bucket)

region_name = boto3.Session().region_name
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

Using SageMaker BlazingText container: 825641698319.dkr.ecr.us-east-2.amazonaws.com/blazingtext:latest (us-east-2)


We also need to set the specifications of the instance that we will use to train the Blazingtext model, and choose the hyperparameters of the model.

In [10]:
sess = sagemaker.Session()

bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=2, 
                                         train_instance_type='ml.c4.2xlarge',
                                         train_volume_size = 5,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

bt_model.set_hyperparameters(mode="batch_skipgram",
                             epochs=15,
                             min_count=5,
                             sampling_threshold=0.0001,
                             learning_rate=0.05,
                             window_size=5,
                             vector_dim=300,
                             negative_samples=5,
                             batch_size=11, #  = (2*window_size + 1) (Preferred. Used only if mode is batch_skipgram)
                             evaluation=True,# Perform similarity evaluation on WS-353 dataset at the end of training
                             subwords=False) # Subword embedding learning is not supported by batch_skipgram

To feed the training data to the model, we need to set a channel for it to access.

In [11]:
train_data = sagemaker.session.s3_input(train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data}

Time to fit the model!

In [15]:
bt_model.fit(inputs=data_channels, logs=True)

2019-07-05 16:25:02 Starting - Starting the training job...
2019-07-05 16:25:03 Starting - Launching requested ML instances......
2019-07-05 16:26:05 Starting - Preparing the instances for training...
2019-07-05 16:26:59 Downloading - Downloading input data
2019-07-05 16:26:59 Training - Downloading the training image..
[31mArguments: train[0m
[31mFound 10.0.241.111 for host algo-1[0m
[31mFound 10.0.202.136 for host algo-2[0m
[32mArguments: train[0m
[32mFound 10.0.241.111 for host algo-1[0m
[32mFound 10.0.202.136 for host algo-2[0m

[32m[07/05/2019 16:27:24 INFO 140673357956928] nvidia-smi took: 0.0251688957214 secs to identify 0 gpus[0m
[32m[07/05/2019 16:27:24 INFO 140673357956928] Running distributed CPU BlazingText training using batch_skipgram on 2 hosts.[0m
[32m[07/05/2019 16:27:24 INFO 140673357956928] Number of hosts: 2, master IP address: 10.0.241.111, host IP address: 10.0.202.136.[0m
[31m[07/05/2019 16:27:24 INFO 139910265243456] nvidia-smi took: 0.0252530

We need to configure an endpoint to host the model. 

In [16]:
# bt_endpoint = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.t2.medium')

### 3. Generate lookup table with wine word embeddings from trained Blazingtext model

We do not need to host an endpoint for our model. Rather, we want to use our model to produce a lookup table for all the wine-related terms in our corpus. We can then use a lambda function to make retrieve information from this lookup table. This will be cheaper and more efficient than permanently hosting an endpoint for our BlazingText model.

To create this lookup table, we will first download the model tarfile from the S3 bucket it was saved to.

In [17]:
s3 = boto3.resource('s3')
key = bt_model.model_data[bt_model.model_data.find("/", 5)+1:]
s3.Bucket(bucket).download_file(key, 'model.tar.gz')

We can then open the tarfile - we see that it consists of three items. We are most interested in the vectors.txt file, which is a text file with our trained word embeddings.

In [18]:
!tar -xvzf model.tar.gz

vectors.bin
vectors.txt
eval.json


Now, we can open the vectors.txt file. We will only keep those descriptors that are in our 'descriptor mapping' of common and meaningful wine descriptors.

We will save the resulting csv file in our S3 bucket.

In [19]:
from sklearn.preprocessing import normalize
num_points = len(open('vectors.txt','r').read().split('\n'))

first_line = True
index_to_word = []
with open("vectors.txt","r") as f:
    for line_num, line in enumerate(f):
        if first_line:
            dim = int(line.strip().split()[1])
            word_vecs = np.zeros((num_points, dim), dtype=float)
            first_line = False
            continue
        line = line.strip()
        word = line.split()[0]
        vec = word_vecs[line_num-1]
        for index, vec_val in enumerate(line.split()[1:]):
            vec[index] = float(vec_val)
        index_to_word.append(word)
        if line_num >= num_points:
            break
word_vecs = normalize(word_vecs, copy=False, return_norm=False)

names_vecs = list(zip(index_to_word, word_vecs))

names_vecs_filtered = [n for n in names_vecs if n[0] in list(descriptor_mapping['level_3'])]

names_vecs_df = pd.DataFrame(names_vecs_filtered, columns=['word', 'vector'])
names_vecs_df.sort_values(by=['word'], inplace=True)
names_vecs_df.to_csv('word_vectors.csv')
boto3.Session().resource('s3').Bucket(bucket).Object('word_vectors.csv').upload_file('word_vectors.csv')

### 4. Convert wine reviews to wine embeddings

Now that we have our word embeddings, we can turn our attention to creating 'wine embeddings': a single vector representation of each wine review. We will go through a few steps to achieve this:

1. Retrieve descriptors from each wine review
2. Use mapping of wine descriptors to 'standardize' these terms
3. Retrieve the word vectors for these standardized wine descriptors
4. Weight each word vector in the wine review by a TF-IDF weighting
5. Take the sum of all the word vectors in each wine review to produce a single 'wine embedding'

First, let's retrieve the descriptors from each wine review:

In [20]:
wine_reviews = list(wine_dataset_relevant['Description'])

def return_descriptor_from_mapping(word):
    if word in list(descriptor_mapping.index):
        descriptor_to_return = descriptor_mapping['level_3'][word]
        return descriptor_to_return

descriptorized_reviews = []
for review in wine_reviews:
    normalized_review = normalize_text(review)
    phrased_review = ngrams[normalized_review]
    descriptors_only = [return_descriptor_from_mapping(word) for word in phrased_review]
    no_nones = [str(d) for d in descriptors_only if d is not None]
    descriptorized_review = ' '.join(no_nones)
    descriptorized_reviews.append(descriptorized_review)

Instead of having a separate file with the IDF scores for each word and a separate file with all the word vectors, we will create a single consolidated file with the IDF-weighted word vectors. This will be more efficient later on in our process.

In [21]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit(descriptorized_reviews)

dict_of_idf_weightings = pd.DataFrame(zip(X.get_feature_names(), X.idf_), columns=['word', 'idf'])

vectors_and_idf = pd.merge(left=names_vecs_df, right=dict_of_idf_weightings, left_on='word', right_on='word', how='inner')
vectors_and_idf['word_vec_idf'] = vectors_and_idf['vector']*vectors_and_idf['idf']
vectors_and_idf = vectors_and_idf[['word', 'word_vec_idf']]
vectors_and_idf.set_index('word', inplace=True)
vectors_and_idf.to_csv('word_vectors_idf.csv')
boto3.Session().resource('s3').Bucket(bucket).Object('word_vectors_idf.csv').upload_file('word_vectors_idf.csv')

Now, we have all the necessary individual pieces to create a precalculated embedding for every wine review. This will be the input variable for our nearest neighbors recommender model.

In [22]:
wine_review_vectors = []
for d in descriptorized_reviews:
    descriptor_count = 0
    weighted_review_terms = []
    terms = d.split(' ')
    
    for term in terms:
        if term in list(vectors_and_idf.index):
            weighted_word_vector = vectors_and_idf.at[term, 'word_vec_idf']
            weighted_review_terms.append(weighted_word_vector)
            descriptor_count += 1
        else:
            continue
    
    try:
        review_vector = sum(weighted_review_terms)/len(weighted_review_terms)
    except:
        review_vector = []
    
    vector_and_count = [terms, review_vector, descriptor_count]
    wine_review_vectors.append(vector_and_count)

wine_review_vectors_df = pd.DataFrame(wine_review_vectors, columns=['descriptors', 'review_vector', 'descriptor_count'])
full_wine_df = pd.concat([wine_dataset_relevant, wine_review_vectors_df], axis=1)
full_wine_df.dropna(how='any', inplace=True)
full_wine_df.drop_duplicates(subset=['Name'], inplace=True)
full_wine_df.to_csv('wine_review_vectors.csv')
boto3.Session().resource('s3').Bucket(bucket).Object('wine_review_vectors.csv').upload_file('wine_review_vectors.csv')
full_wine_df.head()

Unnamed: 0,Name,Description,descriptors,review_vector,descriptor_count
0,Nicosia 2013 Vulkà Bianco (Etna),"Aromas include tropical fruit, broom, brimston...","[tropical_fruit, fruit, dry, herb, apple, citr...","[-0.032441616981936545, 0.15638791585721334, -...",9
1,Quinta dos Avidagos 2011 Avidagos Red (Douro),"This is ripe and fruity, a wine that is smooth...","[ripe, fruit, smooth, firm, juicy, berry, frui...","[-0.08550763328767134, 0.09603241912986349, 0....",8
2,Rainstorm 2013 Pinot Gris (Willamette Valley),"Tart and snappy, the flavors of lime flesh and...","[tart, snappy, lime, green, pineapple, crisp, ...","[0.055965694257395136, 0.1996108806031312, -0....",7
3,St. Julian 2013 Reserve Late Harvest Riesling ...,"Pineapple rind, lemon pith and orange blossom ...","[pineapple, rind, lemon_pith, orange_blossom, ...","[0.11896378047799183, 0.12080962220033609, -0....",6
4,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,"Much like the regular bottling from 2012, this...","[rough, tannin, rustic, earth, herb]","[-0.12856429135555994, 0.12319225682620301, 0....",5


Almost there! Just some final preparation before we train the model. As a part of this, we will remove any wine reviews with fewer than 5 descriptors.

In [23]:
wine_vectors = pd.read_csv('s3://data-science-wine-reviews/nearest_neighbors/data/wine_review_vectors.csv').set_index('Name')

def convert_to_list(raw_review_vec):
    review_vec_trimmed = raw_review_vec.replace('[', '').replace(']', '')
    review_vec = np.fromstring(review_vec_trimmed, dtype=float, sep='  ')
    review_vec_list = review_vec.tolist()
    return review_vec_list
    
wine_vectors['review_vec'] = wine_vectors['review_vector'].apply(convert_to_list)

def count_dim(review_vec):
    vec_dim = len(review_vec)
    return vec_dim

wine_vectors['vec_dim'] = wine_vectors['review_vec'].apply(count_dim)
wine_vectors_filtered = wine_vectors.loc[wine_vectors['vec_dim']==300]
wine_vectors_filtered = wine_vectors.loc[wine_vectors['descriptor_count']>=5]
wine_vectors_filtered = wine_vectors_filtered[['Description', 'descriptors', 'review_vector', 'descriptor_count']]

wine_vectors_filtered.to_csv('wine_review_vectors.csv')
boto3.Session().resource('s3').Bucket(bucket).Object('nearest_neighbors/data/wine_review_vectors.csv').upload_file('wine_review_vectors.csv')

### 5. Train Nearest Neighbors model on wine embeddings

Now, we have everything we need to train our Nearest Neighbors model. Since we will be using an SKLearn implementation of this model, we have to make use of the 'model serving' functionality in the SageMaker Python SDK. This functionality allows us to write custom functions to ingest the data, make predictions (in our case, return the X Nearest Neighbors for a given wine embedding) and return the output in the format we desire. These functions sit in the sklearn_nearest_neighbors.py file. We can call this file, configure our training instance and specify the hyperparameters of our Nearest Neighbors model below/

In [24]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'sklearn_nearest_neighbors.py'
sess = sagemaker.Session()

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.m5.large",
    role=role,
    sagemaker_session=sess,
    hyperparameters={'n_neighbors': 10, 'metric': 'cosine'})

In [25]:
sklearn.fit({'train': 's3://data-science-wine-reviews/nearest_neighbors/data/wine_review_vectors.csv'})

2019-07-05 16:41:50 Starting - Starting the training job...
2019-07-05 16:41:52 Starting - Launching requested ML instances......
2019-07-05 16:42:58 Starting - Preparing the instances for training...
2019-07-05 16:43:48 Downloading - Downloading input data...
2019-07-05 16:44:19 Training - Training image download completed. Training in progress..
[31m2019-07-05 16:44:19,468 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-07-05 16:44:19,470 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-07-05 16:44:19,482 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-07-05 16:44:19,818 sagemaker-containers INFO     Module sklearn_nearest_neighbors does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-07-05 16:44:19,818 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-07-05 16:44:19,819 sagemaker-containers INFO     Generating MA

After fitting the model, we can deploy it.

In [26]:
predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m5.large")

--------------------------------------------------------------------------!

Finally, we can run a quick test to make sure that the model endpoint is returning the desired information.

In [27]:
wine_vectors = pd.read_csv('s3://data-science-wine-reviews/nearest_neighbors/data/wine_review_vectors_sample.csv')

def convert_to_list(raw_review_vec):
    review_vec_trimmed = raw_review_vec.replace('[', '').replace(']', '')
    review_vec = np.fromstring(review_vec_trimmed, dtype=float, sep='  ')
    review_vec_list = review_vec.tolist()
    return review_vec_list
    
wine_vectors['review_vec'] = wine_vectors['review_vector'].apply(convert_to_list)

sample_vector = wine_vectors.at[97, 'review_vec']
sample_vector = np.asarray(sample_vector)

In [28]:
recommendations = predictor.predict(sample_vector)

In [29]:
print(recommendations)

[[1.37459606e-01 1.42040288e-01 1.46988100e-01 1.54312524e-01
  1.56549391e-01 1.62581288e-01 1.62581288e-01 1.62931791e-01
  1.63314825e-01 1.65550581e-01]
 [9.19130000e+04 2.49230000e+04 7.40960000e+04 2.64920000e+04
  7.71960000e+04 9.68710000e+04 1.13695000e+05 8.74650000e+04
  1.00823000e+05 1.44780000e+04]]
