# Summarizing Text with Desc



To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using bahdanau attention on the target data.

The sections of this project are:
- [1.Inspecting the Data](#1.-Insepcting-the-Data)
- [2.Preparing the Data](#2.-Preparing-the-Data)
- [3.Building the Model](#3.-Building-the-Model)
- [4.Training the Model](#4.-Training-the-Model)
- [5.Making Our Own Summaries](#5.-Making-Our-Own-Summaries)

The model is trained with amazon review data.

## Download data
Amazon Reviews Data: [Reviews.csv](https://www.kaggle.com/snap/amazon-fine-food-reviews/downloads/Reviews.csv)


Data set locations for testing:

https://drive.google.com/drive/folders/1QMxZaAMIDFBGCJaWc-453z4NLUDQHE9X?usp=sharing

word embeddings [numberbatch-en-17.06.txt.gz] (https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz)
after download, extract to **./model/numberbatch-en-17.06.txt**


or glove 
https://nlp.stanford.edu/projects/glove/

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import tensor_array_ops
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.12.0


In [2]:
import pickle
def __pickleStuff(filename, stuff):
    save_stuff = open(filename, "wb")
    pickle.dump(stuff, save_stuff)
    save_stuff.close()
def __loadStuff(filename):
    saved_stuff = open(filename,"rb")
    stuff = pickle.load(saved_stuff)
    saved_stuff.close()
    return stuff

# Read data

In [41]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you 'll": "you will",
"you're": "you are",
 "km" : "kilometers",
    "mi" : "miles"
}

In [42]:
def select_text (data,keys):
    tokens_list = []
    for item in data:
        for i in keys:
            d = item[i] 
            d = d.encode("utf8") 
            tokens = clean_doc(d)    
             
            tokens_list.append(str(tokens))
     
    return tokens_list

In [43]:
import re
import string
# turn a doc into clean tokens

regnumber = re.compile(r'^\d+(?:[,.]\d*)?$')
alpha = r'[a-zA-Z]+'
number = r'[-+]?[0-9]*(\.|:)?[0-9]+'
def clean_doc(doc):
    # split into tokens by white space
    ##tokens = doc.split()
    # remove punctuation from each token
    #table = string.maketrans('', '', punctuaion)
    #tokens = [w.translate(table) for w in tokens]
    
    # We are not using "text.split()" here
    #since it is not fool proof, e.g. words followed by punctuations "Are you kidding?I think you aren't."
    #text = re.findall(r"[a-zA-Z]+", doc) #[-+]?[0-9]*\.?[0-9]+
    text = re.findall(r"([a-zA-Z\'']+|[-+]?[0-9]*\.?[0-9]+)", doc)
    
    #text = re.findall(r"[a-zA-Z\'']+",doc)
    new_text = []
    for word in text:
        if word in contractions:
            new_text.append(contractions[word])
             
        else:

            new_text.append(word)
    #new_text = [w for w in new_text if w in vocab]
    text = " ".join(new_text)
    text = re.sub(r'[_"\-;%()|+&=*%!:#$@\[\]/]', ' ', text) #skip ".,?"
    text = re.sub(r'\'', ' ', text)
    return text
                      
    #for w in tokens:         
    #    w =  re.sub('['+string.punctuation+']', '', w )       
    # filter out tokens not in vocab
    
    #tokens = [w for w in tokens if w in vocab]
    #tokens = ' '.join(tokens)
     
    #return tokens

In [44]:
import json
from os import listdir
# load all docs in a directory
def process_docs(directory):
	documents = []
	# walk through all files in the folder
	for filename in listdir(directory):
	    # create the full path of the file to open
	    path = directory + '/' + filename
	    # load the doc
	    #doc = load_doc(path)
	    with open(path, "r") as f:
	 	     doc = json.load(f)
	    #tokens_list = select_text (doc ,['description','facility','nearby'])
	    # clean doc
	    tokens_list = select_text (doc ,['description']) 
	    #tokens = clean_doc(doc[, vocab)
	    # add to list
	    documents.extend(tokens_list)
        #print(tokens_list[:5])
	return documents

In [45]:
# load all training text
hotel_docs = process_docs('./datajson')
hotel_docs[:5]

['With a stay at Petpimarn Boutique Resort in Bangkok Chatuchak you will be within a 15 minute drive of Kasetsart University and IMPACT Arena This hotel is 9.7 miles 15.6 kilometers from Temple of the Emerald Buddha and 10 miles 16.2 kilometers from Wat Arun Make yourself at home in one of the 89 air conditioned rooms featuring refrigerators Complimentary wireless Internet access keeps you connected and digital programming is available for your entertainment Bathrooms have showers and complimentary toiletries Conveniences include desks and complimentary bottled water and housekeeping is provided daily Make use of convenient amenities which include complimentary wireless Internet access and tour ticket assistance At Petpimarn Boutique Resort enjoy a satisfying meal at the restaurant English breakfasts are available daily from 6 30 AM to 10 AM for a fee Featured amenities include dry cleaning laundry services a 24 hour front desk and luggage storage Free self parking is available onsite'

## Load those prepared data and skip to section "[3. Building the Model](#3.-Building-the-Model)"
Once we have run through the "[2.Preparing the Data](#2.-Preparing-the-Data)" section, we should have those data, uncomment and run those lines.

In [3]:
clean_summaries = __loadStuff("./data/clean_summaries.p")
clean_texts = __loadStuff("./data/clean_texts.p")

sorted_summaries = __loadStuff("./data/sorted_summaries.p")
sorted_texts = __loadStuff("./data/sorted_texts.p")
word_embedding_matrix = __loadStuff("./data/word_embedding_matrix.p")

vocab_to_int = __loadStuff("./data/vocab_to_int.p")
int_to_vocab = __loadStuff("./data/int_to_vocab.p")


## 1. Insepcting the Data

In [80]:
reviews = pd.read_csv("Reviews.csv")

In [81]:
reviews.shape

(568454, 10)

In [82]:
reviews.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [83]:
# Check for any nulls values
reviews.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [84]:
# Remove null values and unneeded features
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                        'Score','Time'], 1)
reviews = reviews.reset_index(drop=True)

In [85]:
reviews.shape

(568411, 2)

In [86]:
reviews.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [87]:
# Inspecting some of the reviews
for i in range(5):
    print("Review #",i+1)
    print(reviews.Summary[i])
    print(reviews.Text[i])
    print()

('Review #', 1)
Good Quality Dog Food
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
()
('Review #', 2)
Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
()
('Review #', 3)
"Delight" says it all
This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the 

## 2. Preparing the Data

In [89]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are",
 "km"   : 'kilometers',
    "mi": 'miles'
}

In [47]:
regnumber = re.compile(r'^\d+(?:[,.]\d*)?$')

def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        # We are not using "text.split()" here
        #since it is not fool proof, e.g. words followed by punctuations "Are you kidding?I think you aren't."
        text = re.findall(r"[\w']+", text)
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)# remove links
    text = re.sub(r'\<a href', ' ', text)# remove html link tag
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

In [91]:
clean_text("That's a great movie,Can you believe it?I've.But you may not.")

'great movie believe may'

### Clean the summaries and texts
We will remove the stopwords from the texts because they do not provide much use for training our model. However, we will keep them for our summaries so that they sound more like natural phrases. 

In [92]:
clean_summaries = []
for summary in reviews.Summary:
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")

clean_texts = []
for text in reviews.Text:
    clean_texts.append(clean_text(text))
print("Texts are complete.")

Summaries are complete.
Texts are complete.


In [93]:
# Inspect the cleaned summaries and texts to ensure they have been cleaned well
for i in range(5):
    print("Clean Review #",i+1)
    print(clean_summaries[i])
    print(clean_texts[i])
    print()

('Clean Review #', 1)
good quality dog food
bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product better
()
('Clean Review #', 2)
not as advertised
product arrived labeled jumbo salted peanuts peanuts actually small sized unsalted sure error vendor intended represent product jumbo
()
('Clean Review #', 3)
delight says it all
confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat familiar story c lewis lion witch wardrobe treat seduces edmund selling brother sisters witch
()
('Clean Review #', 4)
cough medicine
looking secret ingredient robitussin believe found got addition root beer extract ordered good made cherry soda flavor medicinal
()
('Clean Review #', 5)
great taffy
great taffy great price wide assortment yummy taffy delivery quick taffy lover

### Count the number of occurrences of each word in a set of text

In [4]:
def count_words(count_dict, text):
    for sentence in text:
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

#### Give the function a try

In [47]:
mydict = {}
count_words(mydict, ["that is a great great great dog","you have a great dog"])
mydict

{'a': 2, 'dog': 2, 'great': 4, 'have': 1, 'is': 1, 'that': 1, 'you': 1}

In [5]:
word_counts = {}
count_words(word_counts, clean_summaries)
count_words(word_counts, clean_texts)
print("Size of Vocabulary:", len(word_counts))

('Size of Vocabulary:', 125808)


Let's see how may "hero" occurs in the data

In [49]:
word_counts["hero"]

114

### Load Conceptnet Numberbatch's (CN) embeddings, similar to GloVe, but probably better 
 (https://github.com/commonsense/conceptnet-numberbatch)

In [6]:
import gensim.models.word2vec as w2v

In [7]:

filename50 ="../trained/hotel2vec_desc-50.w2v"
filename300 ="../trained/hotel2vec_desc-300.w2v"
filename_dnum ="../trained/hotel2vec_desc-number.w2v"
filename_d_gg = '../trained/hotel2vec-gg-desc-300.w2v'

filename_glove_300 = '../glove.6B.300d.txt'
filename_glove_50 = '../glove.6B.50d.txt'
filename_num = '../numberbatch-en-17.02.txt'




In [None]:
from gensim.models import KeyedVectors

#load model gg
model_gg =  KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin',binary=True)
vocab_gg = model_gg.vocab.keys()
wordsInVocab = len(vocab_gg)
 
embeddings_index = {}
for v in vocab_gg:
     
    word = v 
    embedding = np.asarray(model_gg.wv[v], dtype='float32')
    embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))



In [None]:
#load extra des on top of model_gg

filename = filename_d_gg

model_gg_d =  w2v.Word2Vec.load(filename)
word_vectors = model_gg_d.wv

vocab_gg = word_vectors.vocab
 
for v in vocab_gg:
     
    word = v.decode('utf-8')
    embedding = np.asarray(model_gg_d.wv[v], dtype='float32')
    embeddings_index[word] = embedding


print('Word embeddings:', len(embeddings_index))


In [70]:
#load based model
filename = filename_glove_50

embeddings_index = {}
with open(filename) as f:
#with open('./model/numberbatch-en-17.06.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0].decode('utf-8')
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

#load desc model
filename = filename50



model_d =  w2v.Word2Vec.load(filename)
word_vectors = model_d.wv

vocab_gg = word_vectors.vocab
 
for v in vocab_gg:
     
    word = v 
    embedding = np.asarray(model_d.wv[v], dtype='float32')
    embeddings_index[word] = embedding
    
print('Word embeddings:', len(embeddings_index))

('Word embeddings:', 400000)
('Word embeddings:', 400918)


In [9]:
 #load based model
filename = filename_glove_300

embeddings_index = {}
with open(filename) as f:
#with open('./model/numberbatch-en-17.06.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0].decode('utf-8')
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

#load desc model
filename = filename300

model_d =  w2v.Word2Vec.load(filename)
word_vectors = model_d.wv

vocab_gg = word_vectors.vocab
 
for v in vocab_gg:
     
    word = v 
    embedding = np.asarray(model_d.wv[v], dtype='float32')
    embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

KeyboardInterrupt: 

In [40]:
#load based model
filename = filename_num

embeddings_index = {}
with open(filename) as f:
#with open('./model/numberbatch-en-17.06.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0].decode('utf-8')
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

#load desc model
filename = filename_dnum

model_d =  w2v.Word2Vec.load(filename)
word_vectors = model_d.wv

vocab_gg = word_vectors.vocab
 
for v in vocab_gg:
     
    word = v.decode('utf-8')
    embedding = np.asarray(model_d.wv[v], dtype='float32')
    embeddings_index[word] = embedding
 

print('Word embeddings:', len(embeddings_index))

('Word embeddings:', 484557)


### Take a look at the CN embedding dimension

In [44]:
embeddings_index["hero"].shape

(50,)

### Find the number of words that are missing from CN, and are used more than our threshold.

I use a **threshold** of 20, so that words not in CN can be added to our **word_embedding_matrix**, but they need to be common enough in the reviews so that the model can understand their meaning.

In [11]:
missing_words = 0
threshold = 20

for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1
            
missing_ratio = round(missing_words/len(word_counts),4)*100
            
print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

('Number of words missing from CN:', 1972)
Percent of words that are missing from vocabulary: 0.0%


### What are those missing words in the CN
Looks mostly products' brand.

In [12]:
missing_words = []
for word, count in word_counts.items():
    if count > threshold and word not in embeddings_index:
        missing_words.append((word,count))
missing_words[:30]

[('pizzle', 54),
 ('27g', 38),
 ('shelties', 129),
 ('cuddlecuss', 21),
 ('sandies', 50),
 ('golean', 276),
 ('33oz', 35),
 ('detangling', 32),
 ('calbee', 45),
 ('caribu', 29),
 ('eiermann', 31),
 ('bluberry', 31),
 ('completly', 88),
 ('wheatena', 77),
 ('flatulance', 29),
 ('perfumey', 86),
 ('caramello', 30),
 ('peanutty', 179),
 ('eacute', 945),
 ('wayy', 24),
 ('unpleasent', 27),
 ('sprouter', 121),
 ('shakeology', 27),
 ('xylosweet', 56),
 ('foodshouldtastegood', 99),
 ('teavana', 532),
 ('cyto', 30),
 ('proteinate', 233),
 ('tobacman', 39),
 ('recomended', 191)]

### Words to indexes, indexes to words dicts
Limit the vocab that we will use to words that appear ≥ threshold or are in CN

In [13]:
#dictionary to convert words to integers
vocab_to_int = {} 
# Index words from 0
value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

('Total number of unique words:', 125808)
('Number of words we will use:', 63898)
Percent of words we will use: 0.0%


### Create word embedding matrix
It has shape (nb_words, embedding_dim) i.e. (59072, 300) in this case. 1st dim is word index, 2nd dim is from CN or random generated.

In [14]:
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim =  50 # 50 for ./glove.6B.50d.txt
nb_words = len(vocab_to_int)

# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in CN, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

63898


### Function to convert sentences to sequence of words indexes
It also use `<UNK>` index to replace unknown words, append `<EOS>` (End of Sentence) to the sequences if eos is set True

In [15]:
def convert_to_ints(text, word_count, unk_count, eos=False):
    '''Convert words in text to an integer.
       If word is not in vocab_to_int, use UNK's integer.
       Total the number of words and UNKs.
       Add EOS token to the end of texts'''
    ints = []
    for sentence in text:
        sentence_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentence_ints.append(vocab_to_int[word])
            else:
                sentence_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentence_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentence_ints)
    return ints, word_count, unk_count

Apply convert_to_ints to clean_summaries and clean_texts

In [16]:

word_count = 0
unk_count = 0

int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

('Total number of words in headlines:', 26232257)
('Total number of UNKs in headlines:', 147342)
Percent of words that are UNK: 0.0%


### Take a look at what the sequence looks like
Each number here represents a word

In [57]:
int_summaries[:3]

[[60993, 59530, 52011, 2878],
 [34310, 8935, 35494],
 [52121, 54938, 12305, 22910]]

### Function to get the length of each sequence

In [17]:
def create_lengths(text):
    '''Create a data frame of the sentence lengths from a text'''
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])

In [18]:
create_lengths(int_summaries[:3])

Unnamed: 0,counts
0,4
1,3
2,4


Get statistic summary of the length of summaries and texts

In [19]:
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)

print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

Summaries:
              counts
count  568411.000000
mean        4.181230
std         2.657248
min         0.000000
25%         2.000000
50%         4.000000
75%         5.000000
max        48.000000
()
Texts:
              counts
count  568411.000000
mean       42.968927
std        44.164343
min         2.000000
25%        18.000000
50%        30.000000
75%        51.000000
max      2063.000000


### See what's the max squence length we can cover by percentile

In [20]:
# Inspect the length of texts
print(np.percentile(lengths_texts.counts, 89.5))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

84.0
118.0
216.0


In [21]:
# Inspect the length of summaries
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0


## Function to counts the number of time `<UNK>` appears in a sentence

In [22]:
def unk_counter(sentence):
    '''Counts the number of time UNK appears in a sentence.'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count

**Filter** for length limit and number of `<UNK>`s

**Sort** the summaries and texts by the length of the element in **texts** from shortest to longest


In [23]:
max_text_length = 83 # This will cover up to 89.5% lengthes
max_summary_length = 13 # This will cover up to 99% lengthes
min_length = 2
unk_text_limit = 1 # text can contain up to 1 UNK word
unk_summary_limit = 0 # Summary should not contain any UNK word

def filter_condition(item):
    int_summary = item[0]
    int_text = item[1]
    if(len(int_summary) >= min_length and 
       len(int_summary) <= max_summary_length and 
       len(int_text) >= min_length and 
       len(int_text) <= max_text_length and 
       unk_counter(int_summary) <= unk_summary_limit and 
       unk_counter(int_text) <= unk_text_limit):
        return True
    else:
        return False

int_text_summaries = list(zip(int_summaries , int_texts))
int_text_summaries_filtered = list(filter(filter_condition, int_text_summaries))
sorted_int_text_summaries = sorted(int_text_summaries_filtered, key=lambda item: len(item[1]))
sorted_int_text_summaries = list(zip(*sorted_int_text_summaries))
sorted_summaries = list(sorted_int_text_summaries[0])
sorted_texts = list(sorted_int_text_summaries[1])
# Delete those temporary varaibles
del int_text_summaries, sorted_int_text_summaries, int_text_summaries_filtered
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

430497
430497


### Inspect the length of text in sorted_texts

In [65]:
lengths_texts = [len(text) for text in sorted_texts]
lengths_texts[:20]

[2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]

## Save data for later

In [66]:
__pickleStuff("./data/clean_summaries.p",clean_summaries)
__pickleStuff("./data/clean_texts.p",clean_texts)

__pickleStuff("./data/sorted_summaries.p",sorted_summaries)
__pickleStuff("./data/sorted_texts.p",sorted_texts)
__pickleStuff("./data/word_embedding_matrix.p",word_embedding_matrix)

__pickleStuff("./data/vocab_to_int.p",vocab_to_int)
__pickleStuff("./data/int_to_vocab.p",int_to_vocab)

## 3. Building the Model

Create palceholders for inputs to the model

**summary_length** and **text_length** are the sentence lengths in a batch, and **max_summary_length** is the maximum length of a summary in a batch.

In [24]:
def model_inputs():
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')

    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length

Remove the last word id from each batch and concatenate the id of `<GO>` to the begining of each batch

In [25]:
def process_encoding_input(target_data, vocab_to_int, batch_size):  
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1]) # slice it to target_data[0:batch_size, 0: -1]
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input

### Create the encoding layers

bidirectional_dynamic_rnn
use **tf.variable_scope** so that variables are reused with each layer

parameters
- **rnn_size**: The number of units in the LSTM cell
- **sequence_length**: size [batch_size], containing the actual lengths for each of the sequences in the batch
- **num_layers**: number of bidirectional RNN layer
- **rnn_inputs**: number of bidirectional RNN layer
- **keep_prob**: RNN dropout input keep probability

In [26]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
            enc_output = tf.concat(enc_output,2)
            # original code is missing this line below, that is how we connect layers 
            # by feeding the current layer's output to next layer's input
            rnn_inputs = enc_output
    return enc_output, enc_state

### Create the training decoding layer
parameters
- **dec_embed_input**: output of embedding_lookup for a batch of inputs
- **summary_length**: length of each padded summary sequences in batch, since padded, all lengths should be same number 
- **dec_cell**: the decoder RNN cells' output with attention wapper
- **output_layer**: fully connected layer to apply to the RNN output
- **vocab_size**: vocabulary size i.e. len(vocab_to_int)+1
- **max_summary_length**: the maximum length of a summary in a batch
- **batch_size**: number of input sequences in a batch

Three components

- **TraingHelper** reads a sequence of integers from the encoding layer.
- **BasicDecoder** processes the sequence with the decoding cell, and an output layer, which is a fully connected layer. **initial_state** set to zero state.
- **dynamic_decode** creates our outputs that will be used for training.

In [27]:
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, output_layer,
                            vocab_size, max_summary_length,batch_size):
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                        sequence_length=summary_length,
                                                        time_major=False)

    training_decoder = tf.contrib.seq2seq.BasicDecoder(cell=dec_cell,
                                                       helper=training_helper,
                                                       initial_state=dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size),
                                                       output_layer = output_layer)

    training_logits = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    return training_logits

### Create infer decoding layer

parameters
- **embeddings**: the CN's word_embedding_matrix
- **start_token**: the id of `<GO>`
- **end_token**: the id of `<EOS>`
- **dec_cell**: the decoder RNN cells' output with attention wapper
- **output_layer**: fully connected layer to apply to the RNN output
- **max_summary_length**: the maximum length of a summary in a batch
- **batch_size**: number of input sequences in a batch

**GreedyEmbeddingHelper** argument **start_tokens**: int32 vector shaped [batch_size], the start tokens.

In [28]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, output_layer,
                             max_summary_length, batch_size):
    '''Create the inference logits'''
    
    start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                start_tokens,
                                                                end_token)
                
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size),
                                                        output_layer)
                
    inference_logits = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            output_time_major=False,
                                                            impute_finished=True,
                                                            maximum_iterations=max_summary_length)
    
    return inference_logits

### Create Decoding layer
3 parts: decoding cell, attention, and getting our logits.
#### Decoding Cell: 
Just a two layer LSTM with dropout.
#### Attention: 
Using Bhadanau, since trains faster than Luong. 

**AttentionWrapper** applies the attention mechanism to our decoding cell.

parameters
- **dec_embed_input**: output of embedding_lookup for a batch of inputs
- **embeddings**: the CN's word_embedding_matrix
- **enc_output**: encoder layer output, containing the forward and the backward rnn output
- **enc_state**: encoder layer state, a tuple containing the forward and the backward final states of bidirectional rnn.
- **vocab_size**: vocabulary size i.e. len(vocab_to_int)+1
- **text_length**: the actual lengths for each of the input text sequences in the batch
- **summary_length**: the actual lengths for each of the input summary sequences in the batch
- **max_summary_length**: the maximum length of a summary in a batch
- **rnn_size**: The number of units in the LSTM cell
- **vocab_to_int**: vocab_to_int the dictionary
- **keep_prob**: RNN dropout input keep probability
- **batch_size**: number of input sequences in a batch
- **num_layers**: number of decoder RNN layer

In [29]:
def lstm_cell(lstm_size, keep_prob):
    cell = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    return tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob = keep_prob)

def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the training and inference decoding layers'''
    dec_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(rnn_size, keep_prob) for _ in range(num_layers)])
    output_layer = Dense(vocab_size,kernel_initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.1))
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                     enc_output,
                                                     text_length,
                                                     normalize=False,
                                                     name='BahdanauAttention')
    dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell,attn_mech,rnn_size)
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input,summary_length,dec_cell,
                                                  output_layer,
                                                  vocab_size,
                                                  max_summary_length,
                                                  batch_size)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,
                                                    vocab_to_int['<GO>'],
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell,
                                                    output_layer,
                                                    max_summary_length,
                                                    batch_size)
    return training_logits, inference_logits

In [30]:
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
    '''Use the previous functions to create the training and inference logits'''
    
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size) #shape=(batch_size, senquence length) each seq start with index of<GO>
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        text_length, 
                                                        summary_length, 
                                                        max_summary_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers)
    return training_logits, inference_logits

### Pad sentences for batch
Pad so the actual lengths for each of the sequences in the batch have the same length.

In [31]:
def pad_sentence_batch(sentence_batch):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

### Function to generate batch data for training

In [32]:
def get_batches(summaries, texts, batch_size):
    """Batch summaries, texts, and the lengths of their sentences together"""
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i + batch_size]
        texts_batch = texts[start_i:start_i + batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
        
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
        
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths

#### Just to test "get_batches" function
Here we generate a batch with size of 5

Checkout those "59069" they are `<PAD>`s, also all sequences' lengths are the same.

In [33]:
print("'<PAD>' has id: {}".format(vocab_to_int['<PAD>']))
sorted_summaries_samples = sorted_summaries[7:50]
sorted_texts_samples = sorted_texts[7:50]
pad_summaries_batch_samples, pad_texts_batch_samples, pad_summaries_lengths_samples, pad_texts_lengths_samples = next(get_batches(
    sorted_summaries_samples, sorted_texts_samples, 5))
print("pad summaries batch samples:\n\r {}".format(pad_summaries_batch_samples))

'<PAD>' has id: 63895
pad summaries batch samples:
 [[52734 17911 10689 46839 46839 28625 21049 41655 19282 35002 14931 38129]
 [26884 60014 51125 12313  4864 63895 63895 63895 63895 63895 63895 63895]
 [34310  8935 38628  8935  7728 61835 24200 44506  1787 44342 14923 41818]
 [25421 12302 32350 14931 42071  8213 62980 39334 38602 63895 63895 63895]
 [43581 45712 38602 12253  1124 63895 63895 63895 63895 63895 63895 63895]]


In [35]:
# Set the Hyperparameters
epochs = 100
batch_size = 32
rnn_size = 256
#rnn_size = 64
num_layers = 2
learning_rate = 0.005
keep_probability = 0.95

## Build graph

In [36]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      text_length,
                                                      summary_length,
                                                      max_summary_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits[0].rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits[0].sample_id, name='predictions')
    
    # Create the weights for sequence_loss, the sould be all True across since each batch is padded
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
print("Graph is built.")
graph_location = "./graph"
print(graph_location)
train_writer = tf.summary.FileWriter(graph_location)
train_writer.add_graph(train_graph)

Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').
Graph is built.
./graph


## 4. Training the Model

Only going to use a subset of the data to reduce the traing time for this demo.

We chose not use use the start of the subset because because those are shorter sequences and we don't want to make it too easy for the model.

In [37]:
# Subset the data for training
start = 200000
end = start + 50000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))
print("The longest text length:",len(sorted_texts_short[-1]))

('The shortest text length:', 25)
('The longest text length:', 31)


In [38]:
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0 
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0 
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model

checkpoint = "./best_model.ckpt" 
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

config = tf.ConfigProto(device_count = {'GPU': 1})
 

with tf.Session(graph=train_graph, config=config) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    #loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
    #loader.restore(sess, checkpoint)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost],
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})

            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if batch_i % display_step == 0 and batch_i > 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(sorted_texts_short) // batch_size, 
                              batch_loss / display_step, 
                              batch_time*display_step))
                batch_loss = 0

            if batch_i % update_check == 0 and batch_i > 0:
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!') 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, checkpoint)

                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
            
                    
        # Reduce learning rate, but not below its minimum value
        learning_rate *= learning_rate_decay
        if learning_rate < min_learning_rate:
            learning_rate = min_learning_rate
        
        if stop_early == stop:
            print("Stopping Training.")
            break

Epoch   1/100 Batch   20/1562 - Loss:  5.870, Seconds: 3.29
Epoch   1/100 Batch   40/1562 - Loss:  3.138, Seconds: 3.76
Epoch   1/100 Batch   60/1562 - Loss:  3.186, Seconds: 3.82
Epoch   1/100 Batch   80/1562 - Loss:  3.174, Seconds: 3.25
Epoch   1/100 Batch  100/1562 - Loss:  3.078, Seconds: 2.98
Epoch   1/100 Batch  120/1562 - Loss:  3.189, Seconds: 3.29
Epoch   1/100 Batch  140/1562 - Loss:  2.969, Seconds: 3.46
Epoch   1/100 Batch  160/1562 - Loss:  2.966, Seconds: 3.35
Epoch   1/100 Batch  180/1562 - Loss:  3.152, Seconds: 3.40
Epoch   1/100 Batch  200/1562 - Loss:  2.914, Seconds: 3.61
Epoch   1/100 Batch  220/1562 - Loss:  2.752, Seconds: 3.28
Epoch   1/100 Batch  240/1562 - Loss:  3.052, Seconds: 3.13
Epoch   1/100 Batch  260/1562 - Loss:  2.865, Seconds: 2.97
Epoch   1/100 Batch  280/1562 - Loss:  2.811, Seconds: 3.36
Epoch   1/100 Batch  300/1562 - Loss:  2.833, Seconds: 3.56
Epoch   1/100 Batch  320/1562 - Loss:  2.891, Seconds: 3.36
Epoch   1/100 Batch  340/1562 - Loss:  3

Epoch   2/100 Batch 1120/1562 - Loss:  2.011, Seconds: 3.26
Epoch   2/100 Batch 1140/1562 - Loss:  1.906, Seconds: 3.62
Epoch   2/100 Batch 1160/1562 - Loss:  2.209, Seconds: 3.92
Epoch   2/100 Batch 1180/1562 - Loss:  2.289, Seconds: 4.36
Epoch   2/100 Batch 1200/1562 - Loss:  2.407, Seconds: 3.67
Epoch   2/100 Batch 1220/1562 - Loss:  2.325, Seconds: 3.54
Epoch   2/100 Batch 1240/1562 - Loss:  2.097, Seconds: 3.71
Epoch   2/100 Batch 1260/1562 - Loss:  2.230, Seconds: 3.47
Epoch   2/100 Batch 1280/1562 - Loss:  2.239, Seconds: 3.46
Epoch   2/100 Batch 1300/1562 - Loss:  1.986, Seconds: 3.94
Epoch   2/100 Batch 1320/1562 - Loss:  2.101, Seconds: 3.52
Epoch   2/100 Batch 1340/1562 - Loss:  1.989, Seconds: 3.37
Epoch   2/100 Batch 1360/1562 - Loss:  1.981, Seconds: 3.88
Epoch   2/100 Batch 1380/1562 - Loss:  1.907, Seconds: 3.62
Epoch   2/100 Batch 1400/1562 - Loss:  1.926, Seconds: 3.83
Epoch   2/100 Batch 1420/1562 - Loss:  2.215, Seconds: 4.04
Epoch   2/100 Batch 1440/1562 - Loss:  2

Epoch   4/100 Batch  660/1562 - Loss:  1.901, Seconds: 3.46
Epoch   4/100 Batch  680/1562 - Loss:  1.913, Seconds: 4.05
Epoch   4/100 Batch  700/1562 - Loss:  1.645, Seconds: 3.96
Epoch   4/100 Batch  720/1562 - Loss:  1.874, Seconds: 3.05
Epoch   4/100 Batch  740/1562 - Loss:  1.763, Seconds: 3.58
Epoch   4/100 Batch  760/1562 - Loss:  1.731, Seconds: 3.75
Epoch   4/100 Batch  780/1562 - Loss:  1.518, Seconds: 3.61
Epoch   4/100 Batch  800/1562 - Loss:  1.708, Seconds: 3.62
Epoch   4/100 Batch  820/1562 - Loss:  1.772, Seconds: 3.80
Epoch   4/100 Batch  840/1562 - Loss:  1.635, Seconds: 3.66
Epoch   4/100 Batch  860/1562 - Loss:  1.732, Seconds: 3.77
Epoch   4/100 Batch  880/1562 - Loss:  1.832, Seconds: 3.38
Epoch   4/100 Batch  900/1562 - Loss:  1.960, Seconds: 3.66
Epoch   4/100 Batch  920/1562 - Loss:  1.821, Seconds: 3.76
Epoch   4/100 Batch  940/1562 - Loss:  1.966, Seconds: 3.37
Epoch   4/100 Batch  960/1562 - Loss:  1.725, Seconds: 3.33
Epoch   4/100 Batch  980/1562 - Loss:  1

Epoch   6/100 Batch  200/1562 - Loss:  1.530, Seconds: 3.51
Epoch   6/100 Batch  220/1562 - Loss:  1.345, Seconds: 3.32
Epoch   6/100 Batch  240/1562 - Loss:  1.627, Seconds: 3.12
Epoch   6/100 Batch  260/1562 - Loss:  1.591, Seconds: 3.40
Epoch   6/100 Batch  280/1562 - Loss:  1.482, Seconds: 3.53
Epoch   6/100 Batch  300/1562 - Loss:  1.488, Seconds: 3.36
Epoch   6/100 Batch  320/1562 - Loss:  1.667, Seconds: 3.45
Epoch   6/100 Batch  340/1562 - Loss:  1.824, Seconds: 3.03
Epoch   6/100 Batch  360/1562 - Loss:  1.732, Seconds: 3.89
Epoch   6/100 Batch  380/1562 - Loss:  1.612, Seconds: 3.11
Epoch   6/100 Batch  400/1562 - Loss:  1.675, Seconds: 3.34
Epoch   6/100 Batch  420/1562 - Loss:  1.675, Seconds: 3.38
Epoch   6/100 Batch  440/1562 - Loss:  1.515, Seconds: 3.54
Epoch   6/100 Batch  460/1562 - Loss:  1.636, Seconds: 3.67
Epoch   6/100 Batch  480/1562 - Loss:  1.501, Seconds: 3.40
Epoch   6/100 Batch  500/1562 - Loss:  1.335, Seconds: 3.58
('Average loss for this update:', 1.622)

Epoch   7/100 Batch 1300/1562 - Loss:  1.368, Seconds: 3.76
Epoch   7/100 Batch 1320/1562 - Loss:  1.355, Seconds: 3.31
Epoch   7/100 Batch 1340/1562 - Loss:  1.321, Seconds: 3.49
Epoch   7/100 Batch 1360/1562 - Loss:  1.372, Seconds: 3.18
Epoch   7/100 Batch 1380/1562 - Loss:  1.313, Seconds: 3.99
Epoch   7/100 Batch 1400/1562 - Loss:  1.346, Seconds: 4.09
Epoch   7/100 Batch 1420/1562 - Loss:  1.578, Seconds: 4.26
Epoch   7/100 Batch 1440/1562 - Loss:  1.700, Seconds: 3.85
Epoch   7/100 Batch 1460/1562 - Loss:  1.698, Seconds: 4.09
Epoch   7/100 Batch 1480/1562 - Loss:  1.619, Seconds: 3.76
Epoch   7/100 Batch 1500/1562 - Loss:  1.569, Seconds: 4.17
Epoch   7/100 Batch 1520/1562 - Loss:  1.457, Seconds: 4.00
Epoch   7/100 Batch 1540/1562 - Loss:  1.513, Seconds: 3.91
('Average loss for this update:', 1.472)
New Record!
Epoch   7/100 Batch 1560/1562 - Loss:  1.396, Seconds: 3.66
Epoch   8/100 Batch   20/1562 - Loss:  1.797, Seconds: 3.11
Epoch   8/100 Batch   40/1562 - Loss:  1.492, S

Epoch   9/100 Batch  840/1562 - Loss:  1.272, Seconds: 3.52
Epoch   9/100 Batch  860/1562 - Loss:  1.326, Seconds: 3.73
Epoch   9/100 Batch  880/1562 - Loss:  1.379, Seconds: 3.78
Epoch   9/100 Batch  900/1562 - Loss:  1.499, Seconds: 3.47
Epoch   9/100 Batch  920/1562 - Loss:  1.411, Seconds: 3.72
Epoch   9/100 Batch  940/1562 - Loss:  1.519, Seconds: 3.50
Epoch   9/100 Batch  960/1562 - Loss:  1.325, Seconds: 3.58
Epoch   9/100 Batch  980/1562 - Loss:  1.312, Seconds: 3.42
Epoch   9/100 Batch 1000/1562 - Loss:  1.374, Seconds: 3.38
Epoch   9/100 Batch 1020/1562 - Loss:  1.380, Seconds: 3.83
('Average loss for this update:', 1.355)
New Record!
Epoch   9/100 Batch 1040/1562 - Loss:  1.231, Seconds: 4.01
Epoch   9/100 Batch 1060/1562 - Loss:  1.187, Seconds: 3.76
Epoch   9/100 Batch 1080/1562 - Loss:  1.206, Seconds: 3.66
Epoch   9/100 Batch 1100/1562 - Loss:  1.330, Seconds: 2.96
Epoch   9/100 Batch 1120/1562 - Loss:  1.306, Seconds: 3.13
Epoch   9/100 Batch 1140/1562 - Loss:  1.180, S

Epoch  16/100 Batch  680/1562 - Loss:  1.144, Seconds: 3.97
Epoch  16/100 Batch  700/1562 - Loss:  1.014, Seconds: 3.52
Epoch  16/100 Batch  720/1562 - Loss:  1.099, Seconds: 3.41
Epoch  16/100 Batch  740/1562 - Loss:  1.084, Seconds: 3.84
Epoch  16/100 Batch  760/1562 - Loss:  1.050, Seconds: 3.43
Epoch  16/100 Batch  780/1562 - Loss:  0.858, Seconds: 3.53
Epoch  16/100 Batch  800/1562 - Loss:  1.012, Seconds: 3.38
Epoch  16/100 Batch  820/1562 - Loss:  1.029, Seconds: 3.81
Epoch  16/100 Batch  840/1562 - Loss:  0.976, Seconds: 3.54
Epoch  16/100 Batch  860/1562 - Loss:  1.008, Seconds: 3.60
Epoch  16/100 Batch  880/1562 - Loss:  1.018, Seconds: 3.62
Epoch  16/100 Batch  900/1562 - Loss:  1.127, Seconds: 3.66
Epoch  16/100 Batch  920/1562 - Loss:  1.076, Seconds: 3.58
Epoch  16/100 Batch  940/1562 - Loss:  1.181, Seconds: 3.71
Epoch  16/100 Batch  960/1562 - Loss:  1.019, Seconds: 3.24
Epoch  16/100 Batch  980/1562 - Loss:  1.016, Seconds: 3.60
Epoch  16/100 Batch 1000/1562 - Loss:  1

Epoch  18/100 Batch  220/1562 - Loss:  0.856, Seconds: 3.28
Epoch  18/100 Batch  240/1562 - Loss:  0.999, Seconds: 3.33
Epoch  18/100 Batch  260/1562 - Loss:  1.008, Seconds: 3.39
Epoch  18/100 Batch  280/1562 - Loss:  0.971, Seconds: 3.63
Epoch  18/100 Batch  300/1562 - Loss:  0.942, Seconds: 3.35
Epoch  18/100 Batch  320/1562 - Loss:  1.069, Seconds: 3.44
Epoch  18/100 Batch  340/1562 - Loss:  1.131, Seconds: 3.28
Epoch  18/100 Batch  360/1562 - Loss:  1.082, Seconds: 3.59
Epoch  18/100 Batch  380/1562 - Loss:  1.011, Seconds: 3.35
Epoch  18/100 Batch  400/1562 - Loss:  1.044, Seconds: 3.32
Epoch  18/100 Batch  420/1562 - Loss:  1.037, Seconds: 3.67
Epoch  18/100 Batch  440/1562 - Loss:  0.987, Seconds: 3.63
Epoch  18/100 Batch  460/1562 - Loss:  1.059, Seconds: 3.05
Epoch  18/100 Batch  480/1562 - Loss:  0.949, Seconds: 3.24
Epoch  18/100 Batch  500/1562 - Loss:  0.778, Seconds: 3.37
('Average loss for this update:', 1.03)
No Improvement.
Epoch  18/100 Batch  520/1562 - Loss:  0.998

Epoch  19/100 Batch 1320/1562 - Loss:  0.869, Seconds: 3.48
Epoch  19/100 Batch 1340/1562 - Loss:  0.865, Seconds: 3.86
Epoch  19/100 Batch 1360/1562 - Loss:  0.907, Seconds: 3.64
Epoch  19/100 Batch 1380/1562 - Loss:  0.862, Seconds: 3.42
Epoch  19/100 Batch 1400/1562 - Loss:  0.925, Seconds: 3.86
Epoch  19/100 Batch 1420/1562 - Loss:  1.031, Seconds: 3.88
Epoch  19/100 Batch 1440/1562 - Loss:  1.115, Seconds: 3.75
Epoch  19/100 Batch 1460/1562 - Loss:  1.162, Seconds: 3.76
Epoch  19/100 Batch 1480/1562 - Loss:  1.092, Seconds: 3.51
Epoch  19/100 Batch 1500/1562 - Loss:  1.043, Seconds: 3.64
Epoch  19/100 Batch 1520/1562 - Loss:  1.005, Seconds: 3.94
Epoch  19/100 Batch 1540/1562 - Loss:  1.043, Seconds: 3.88
('Average loss for this update:', 0.969)
No Improvement.
Epoch  19/100 Batch 1560/1562 - Loss:  0.929, Seconds: 3.86
Epoch  20/100 Batch   20/1562 - Loss:  1.260, Seconds: 3.28
Epoch  20/100 Batch   40/1562 - Loss:  1.004, Seconds: 3.17
Epoch  20/100 Batch   60/1562 - Loss:  1.00

Epoch  21/100 Batch  840/1562 - Loss:  0.826, Seconds: 3.39
Epoch  21/100 Batch  860/1562 - Loss:  0.847, Seconds: 3.28
Epoch  21/100 Batch  880/1562 - Loss:  0.877, Seconds: 3.54
Epoch  21/100 Batch  900/1562 - Loss:  0.935, Seconds: 3.49
Epoch  21/100 Batch  920/1562 - Loss:  0.926, Seconds: 3.79
Epoch  21/100 Batch  940/1562 - Loss:  1.015, Seconds: 3.70
Epoch  21/100 Batch  960/1562 - Loss:  0.858, Seconds: 3.56
Epoch  21/100 Batch  980/1562 - Loss:  0.839, Seconds: 3.71
Epoch  21/100 Batch 1000/1562 - Loss:  0.930, Seconds: 3.73
Epoch  21/100 Batch 1020/1562 - Loss:  0.956, Seconds: 3.74
('Average loss for this update:', 0.905)
New Record!
Epoch  21/100 Batch 1040/1562 - Loss:  0.851, Seconds: 3.56
Epoch  21/100 Batch 1060/1562 - Loss:  0.779, Seconds: 3.76
Epoch  21/100 Batch 1080/1562 - Loss:  0.829, Seconds: 3.62
Epoch  21/100 Batch 1100/1562 - Loss:  0.907, Seconds: 3.32
Epoch  21/100 Batch 1120/1562 - Loss:  0.909, Seconds: 3.67
Epoch  21/100 Batch 1140/1562 - Loss:  0.765, S

Epoch  23/100 Batch  360/1562 - Loss:  0.961, Seconds: 3.91
Epoch  23/100 Batch  380/1562 - Loss:  0.888, Seconds: 3.17
Epoch  23/100 Batch  400/1562 - Loss:  0.893, Seconds: 3.33
Epoch  23/100 Batch  420/1562 - Loss:  0.853, Seconds: 3.60
Epoch  23/100 Batch  440/1562 - Loss:  0.870, Seconds: 3.59
Epoch  23/100 Batch  460/1562 - Loss:  0.924, Seconds: 3.38
Epoch  23/100 Batch  480/1562 - Loss:  0.828, Seconds: 3.29
Epoch  23/100 Batch  500/1562 - Loss:  0.670, Seconds: 3.29
('Average loss for this update:', 0.899)
No Improvement.
Epoch  23/100 Batch  520/1562 - Loss:  0.840, Seconds: 3.39
Epoch  23/100 Batch  540/1562 - Loss:  0.833, Seconds: 3.54
Epoch  23/100 Batch  560/1562 - Loss:  0.873, Seconds: 3.74
Epoch  23/100 Batch  580/1562 - Loss:  0.818, Seconds: 3.72
Epoch  23/100 Batch  600/1562 - Loss:  0.901, Seconds: 3.45
Epoch  23/100 Batch  620/1562 - Loss:  1.008, Seconds: 3.57
Epoch  23/100 Batch  640/1562 - Loss:  0.986, Seconds: 3.30
Epoch  23/100 Batch  660/1562 - Loss:  0.98

Epoch  24/100 Batch 1440/1562 - Loss:  1.027, Seconds: 3.74
Epoch  24/100 Batch 1460/1562 - Loss:  1.032, Seconds: 3.41
Epoch  24/100 Batch 1480/1562 - Loss:  0.959, Seconds: 3.32
Epoch  24/100 Batch 1500/1562 - Loss:  0.952, Seconds: 4.07
Epoch  24/100 Batch 1520/1562 - Loss:  0.906, Seconds: 4.16
Epoch  24/100 Batch 1540/1562 - Loss:  0.951, Seconds: 4.02
('Average loss for this update:', 0.866)
No Improvement.
Epoch  24/100 Batch 1560/1562 - Loss:  0.825, Seconds: 3.87
Epoch  25/100 Batch   20/1562 - Loss:  1.137, Seconds: 3.22
Epoch  25/100 Batch   40/1562 - Loss:  0.906, Seconds: 3.50
Epoch  25/100 Batch   60/1562 - Loss:  0.909, Seconds: 3.51
Epoch  25/100 Batch   80/1562 - Loss:  0.941, Seconds: 3.24
Epoch  25/100 Batch  100/1562 - Loss:  0.870, Seconds: 3.46
Epoch  25/100 Batch  120/1562 - Loss:  0.896, Seconds: 3.64
Epoch  25/100 Batch  140/1562 - Loss:  0.843, Seconds: 3.30
Epoch  25/100 Batch  160/1562 - Loss:  0.868, Seconds: 3.42
Epoch  25/100 Batch  180/1562 - Loss:  0.86

## 5. Making Our Own Summaries

To see the quality of the summaries that this model can generate, you can either create your own review, or use a review from the dataset. You can set the length of the summary to a fixed value, or use a random value like I have here.

In [117]:
def text_to_seq(text):
    '''Prepare the text for the model'''
    
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]


- **input_sentences**: a list of reviews strings we are going to summarize
- **generagte_summary_length**: a int or list, if a list must be same length as input_sentences


In [118]:
texts = [text_to_seq(input_sentence) for input_sentence in hotel_docs]


In [120]:
input_sentences=["The coffee tasted great and was at such a good price! I highly recommend this to everyone!",
               "love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets"]

input_sentences = hotel_docs
generagte_summary_length =  [100] * len(input_sentences)
texts = [text_to_seq(input_sentence) for input_sentence in input_sentences]
checkpoint = "./best_model.ckpt"
print(len(generagte_summary_length))
if type(generagte_summary_length) is list:
    if len(input_sentences)!=len(generagte_summary_length):
        raise Exception("[Error] makeSummaries parameter generagte_summary_length must be same length as input_sentences or an integer")
    generagte_summary_length_list = generagte_summary_length
else:
    generagte_summary_length_list = [generagte_summary_length] * len(texts)
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)
    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    #Multiply by batch_size to match the model's input parameters
    for i, text in enumerate(texts):
        generagte_summary_length = generagte_summary_length_list[i]
        answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                          summary_length: [generagte_summary_length], #summary_length: [np.random.randint(5,8)], 
                                          text_length: [len(text)]*batch_size,
                                          keep_prob: 1.0})[0] 
        # Remove the padding from the summaries
        pad = vocab_to_int["<PAD>"] 
        print('- Review:\n\r {}'.format(input_sentences[i]))
        print('- Summary:\n\r {}\n\r\n\r'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

5423
INFO:tensorflow:Restoring parameters from ./best_model.ckpt
- Review:
 With a stay at Petpimarn Boutique Resort in Bangkok Chatuchak you will be within a 15 minute drive of Kasetsart University and IMPACT Arena This hotel is 9.7 miles 15.6 kilometers from Temple of the Emerald Buddha and 10 miles 16.2 kilometers from Wat Arun Make yourself at home in one of the 89 air conditioned rooms featuring refrigerators Complimentary wireless Internet access keeps you connected and digital programming is available for your entertainment Bathrooms have showers and complimentary toiletries Conveniences include desks and complimentary bottled water and housekeeping is provided daily Make use of convenient amenities which include complimentary wireless Internet access and tour ticket assistance At Petpimarn Boutique Resort enjoy a satisfying meal at the restaurant English breakfasts are available daily from 6 30 AM to 10 AM for a fee Featured amenities include dry cleaning laundry services a 24 

- Review:
 With a stay at Room for you in Bangkok Don Muang you will be 9 minutes by car from IMPACT Arena This hotel is 10 miles 16 kilometers from Chatuchak Weekend Market and 14.7 miles 23.6 kilometers from Temple of the Emerald Buddha Make yourself at home in one of the 18 air conditioned rooms featuring refrigerators and flat screen televisions Complimentary wireless Internet access keeps you connected and digital programming is available for your entertainment Bathrooms have showers and complimentary toiletries Conveniences include complimentary bottled water and housekeeping is provided daily Take in the views from a terrace and make use of amenities such as complimentary wireless Internet access Featured amenities include dry cleaning laundry services and luggage storage Free self parking is available onsite
- Summary:
 meh my daughter s favorite gift


- Review:
 Staying at Jumbotel Hotel is a good choice when you are visiting Lak Si The hotel has a very good location also nea

- Review:
 yellow hostel is a hostel in a good neighborhood which is located at Don Mueang Airport The hostel has a very good location also near the Don Mueang International Airport DMK which is only 1.61 kilometers away Not only well positioned but yellow hostel is also one of hostels near the following Donmuang Taharnargardbumrung School within 1.33 kilometers and Don Mueang International Airport DMK within 1.61 kilometers yellow hostel is highly recommended for backpackers who want to get an affordable stay yet comfortable at the same time For you travelers who wish to travel comfortably on a budget yellow hostel is the perfect place to stay that provides decent facilities as well as great services While traveling with friends can be a lot of fun traveling solo has its own perks As for the accommodation yellow hostel is suitable for you who value privacy during your stay WiFi is available within public areas of the property to help you to stay connected with family and friends yello

- Review:
 Staying at Big Smile Hostel is a good choice when you are visiting Don Mueang Airport The hostel has a very good location also near the Don Mueang International Airport DMK which is only 1.78 kilometers away This hostel is very easy to find since it is strategically positioned close to public facilities Not only located within easy reach of various places of interests for your adventure but staying at Big Smile Hostel will also give you a pleasant stay Big Smile Hostel is highly recommended for backpackers who want to get an affordable stay yet comfortable at the same time For you travelers who wish to travel comfortably on a budget Big Smile Hostel is the perfect place to stay that provides decent facilities as well as great services Spend quality time at Big Smile Hostel with your spouse Make it an unforgettable stay by enjoying all services and facilities that the hostel has to offer This hostel is the perfect choice for couples seeking a romantic getaway or a honeymoon r

- Review:
 Staying at Beekataa Hostel Donmueang is a good choice when you are visiting Don Mueang Airport The hostel has a very good location also near the Don Mueang International Airport DMK which is only 1.96 kilometers away This hostel is very easy to find since it is strategically positioned close to public facilities Beekataa Hostel Donmueang is highly recommended for backpackers who want to get an affordable stay yet comfortable at the same time For you travelers who wish to travel comfortably on a budget Beekataa Hostel Donmueang is the perfect place to stay that provides decent facilities as well as great services While traveling with friends can be a lot of fun traveling solo has its own perks As for the accommodation Beekataa Hostel Donmueang is suitable for you who value privacy during your stay Beekataa Hostel Donmueang is the right choice for you who are looking for affordable accommodation in Don Mueang Airport WiFi is available within public areas of the property to hel

- Review:
 note hotel was previously named D Well Residence D Well Residence Don Muang is a hotel in a good neighborhood which is located at Don Mueang Airport The hotel has a very good location also near the Don Mueang Intl Airport Airport which is only 1.94 kilometers away Not only well positioned but D Well Residence Don Muang is also one of hotels near the following Don Mueang Intl Airport within 1.94 kilometers and IT Square within 3.12 kilometers Not only located within easy reach of various places of interests for your adventure but staying at D Well Residence Don Muang will also give you a pleasant stay D Well Residence Don Muang is highly recommended for backpackers who want to get an affordable stay yet comfortable at the same time For you travelers who wish to travel comfortably on a budget D Well Residence Don Muang is the perfect place to stay that provides decent facilities as well as great services This hotel is the perfect choice for couples seeking a romantic getaway o

- Review:
 With a stay at TKMY DMK in Bangkok Don Muang you will be 12.2 miles 19.7 kilometers from Temple of the Emerald Buddha and 8.1 miles 13 kilometers from Chatuchak Weekend Market This guesthouse is 10.2 miles 16.5 kilometers from Vimanmek Palace and 11.2 miles 18 kilometers from Wat Saket Make yourself at home in one of the 4 air conditioned guestrooms Complimentary wireless Internet access is available to keep you connected Bathrooms have complimentary toiletries and hair dryers Conveniences include coffee tea makers and complimentary bottled water and housekeeping is provided on a limited basis Take in the views from a garden and make use of amenities such as complimentary wireless Internet access and a television in a common area Cooked to order breakfasts are available daily from 6 30 AM to 10 AM for a fee Featured amenities include laundry facilities microwave in a common area and refrigerator in a common area A roundtrip airport shuttle is provided for a surcharge during 

- Review:
 Donmuang Airport Modern Bangkok Hotel is located in area city Don Mueang Airport The hotel has a very good location also near the Don Mueang International Airport DMK which is only 1.55 kilometers away There are plenty of tourist attractions nearby such as Donmuang Taharnargardbumrung School within 0.94 kilometers and Don Mueang International Airport DMK within 1.55 kilometers Donmuang Airport Modern Bangkok Hotel is a hotel near Airport an ideal accommodation while waiting for your next flight Enjoy a satisfying place to rest during your transit When staying at a hotel the design and architecture are two important factors that can spoil your eyes With its unique setting Donmuang Airport Modern Bangkok Hotel provides a pleasant accommodation for your stay 24 hours front desk is available to serve you from check in to check out or any assistance you need Should you desire more do not hesitate to ask the front desk we are always ready to accommodate you WiFi is available withi

- Review:
 Staying at I Rich Residence is a good choice when you are visiting Don Mueang Airport The apartment has a very good location also near the Don Mueang International Airport DMK which is only 4.88 kilometers away This apartment is very easy to find since it is strategically positioned close to public facilities I Rich Residence is highly recommended for backpackers who want to get an affordable stay yet comfortable at the same time For you travelers who wish to travel comfortably on a budget I Rich Residence is the perfect place to stay that provides decent facilities as well as great services Spend quality time at I Rich Residence with your spouse Make it an unforgettable stay by enjoying all services and facilities that the apartment has to offer I Rich Residence is a hotel near Airport an ideal accommodation while waiting for your next flight Enjoy a satisfying place to rest during your transit From business event to corporate gathering I Rich Residence provides complete se

- Review:
 Staying at BCJ Residence is a good choice when you are visiting Lak Si The hotel has a very good location also near the Don Mueang International Airport DMK which is only 7.61 kilometers away The hotel is located only 6.79 kilometers away from Mo Chit BTS Station This hotel is very easy to find since it is strategically positioned close to public facilities Splendid service together with wide range of facilities provided will make you complain for nothing during your stay at BCJ Residence The hotel s fitness center is a must try during your stay here 24 hours front desk is available to serve you from check in to check out or any assistance you need Should you desire more do not hesitate to ask the front desk we are always ready to accommodate you BCJ Residence is a hotel with great comfort and excellent service according to most hotel s guests With all facilities offered BCJ Residence is the right place to stay
- Summary:
 makes the ultimate


- Review:
 Ebina House is locat

- Review:
 Staying at Don Muang Airport Hostel is a good choice when you are visiting Anusawari The hostel has a very good location also near the Don Mueang International Airport DMK which is only 3.12 kilometers away This hostel is very easy to find since it is strategically positioned close to public facilities Don Muang Airport Hostel is highly recommended for backpackers who want to get an affordable stay yet comfortable at the same time For you travelers who wish to travel comfortably on a budget Don Muang Airport Hostel is the perfect place to stay that provides decent facilities as well as great services Don Muang Airport Hostel is a hotel near Airport an ideal accommodation while waiting for your next flight Enjoy a satisfying place to rest during your transit While traveling with friends can be a lot of fun traveling solo has its own perks As for the accommodation Don Muang Airport Hostel is suitable for you who value privacy during your stay Splendid service together with wid

- Review:
 TK Palace Hotel is a hotel in a good neighborhood which is located at Lak Si The hotel has a very good location also near the Don Mueang International Airport DMK which is only 5.41 kilometers away The hotel is located only 9.99 kilometers away from Mo Chit BTS Station Not only well positioned but TK Palace Hotel is also one of hotels near the following Chaeng Watthana Government Complex within 1.08 kilometers and Rajpruek Golf Club within 1.72 kilometers Whether you are planning an event or other special occasions TK Palace Hotel is a great choice for you with a large and well equipped function room to suit your requirements TK Palace Hotel is a hotel near Airport an ideal accommodation while waiting for your next flight Enjoy a satisfying place to rest during your transit From business event to corporate gathering TK Palace Hotel provides complete services and facilities that you and your colleagues need Be ready to get the unforgettable stay experience by its exclusive se

- Review:
 description
- Summary:
 herbes receive herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes herbes


- Review:
 SC Park Hotel is a hotel in a good neighborhood which is located at Phlabphla The hotel is located only 5.44 kilometers away from Phrom Phong BTS Station Not only well positioned but SC Park Hotel is also one of hotels near the following Mall Ramkhamh

- Review:
 Staying at QG Resort is a good choice when you are visiting Suvarnabhumi Airport The resort has a very good location also near the Suvarnabhumi International Airport BKK which is only 5.41 kilometers away This resort is very easy to find since it is strategically positioned close to public facilities QG Resort is a hotel near Airport an ideal accommodation while waiting for your next flight Enjoy a satisfying place to rest during your transit When staying at a resort the design and architecture are two important factors that can spoil your eyes With its unique setting QG Resort provides a pleasant accommodation for your stay QG Resort is the smartest choice for you who are looking for affordable accommodation with outstanding service Savor your favorite dishes with special cuisines from QG Resort exclusively for you WiFi is available within public areas of the property to help you to stay connected with family and friends Staying at QG Resort will surely satisfy you with it

InvalidArgumentError: assertion failed: [All values in memory_sequence_length must greater than zero.] [Condition x > 0 did not hold element-wise:] [x (text_length:0) = ] [0 0 0...]
	 [[node decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert (defined at <ipython-input-120-aa75814a47dc>:18)  = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/All/_175, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert/data_0, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert/data_1, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert/data_2, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Less/Enter/_177)]]

Caused by op u'decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 1073, in start
    handler_func(fd_obj, events)
  File "/usr/lib64/python2.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2714, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2818, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2878, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-120-aa75814a47dc>", line 18, in <module>
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1674, in import_meta_graph
    meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1696, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3440, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3299, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): assertion failed: [All values in memory_sequence_length must greater than zero.] [Condition x > 0 did not hold element-wise:] [x (text_length:0) = ] [0 0 0...]
	 [[node decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert (defined at <ipython-input-120-aa75814a47dc>:18)  = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/All/_175, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert/data_0, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert/data_1, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Assert/Assert/data_2, decode/decoder/while/BasicDecoderStep/decoder/attention_wrapper/assert_positive_1/assert_less/Less/Enter/_177)]]


## Summary

I hope that you found this project to be rather interesting and informative. One of my main recommendations for working with this dataset and model is either use a GPU, a subset of the dataset, or plenty of time to train your model. As you might be able to expect, the model will not be able to make good predictions just by seeing many reviews, it needs so see the reviews many times to be able to understand the relationship between words and between descriptions & summaries. 

In short, I'm pleased with how well this model performs. After creating numerous reviews and checking those from the dataset, I can happily say that most of the generated summaries are appropriate, some of them are great, and some of them make mistakes. I'll try to improve this model and if it gets better, I'll update my GitHub.

Thanks for reading!