# Text Summarization of Amazon reviews

In this notebook I will write summaries with the help of my Seq2Seq model in Summarizer.py.

The model works impressively well in the end!

In [0]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from collections import Counter

#import Summarizer
#import summarizer_data_utils
#import summarizer_model_utils

In [2]:
print(tf.__version__)

1.12.0-rc2


## Helpers (Google Connect, Summurizer)

### Google Connect

In [0]:
#working google drive ya RAB
#https://stackoverflow.com/questions/52385655/unable-to-locate-package-google-drive-ocamlfuse-suddenly-stopped-working


!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!wget https://launchpad.net/~alessandro-strada/+archive/ubuntu/google-drive-ocamlfuse-beta/+build/15331130/+files/google-drive-ocamlfuse_0.7.0-0ubuntu1_amd64.deb
!dpkg -i google-drive-ocamlfuse_0.7.0-0ubuntu1_amd64.deb
!apt-get install -f
!apt-get -y install -qq fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

!mkdir -p drive
!google-drive-ocamlfuse drive

### Summurizer

#### summarizer_model_utils

In [0]:
import numpy as np
import tensorflow as tf
from nltk.translate.bleu_score import sentence_bleu


def minibatches(inputs, targets, minibatch_size):
    """batch generator. yields x and y batch.
    """
    x_batch, y_batch = [], []
    for inp, tgt in zip(inputs, targets):
        if len(x_batch) == minibatch_size and len(y_batch) == minibatch_size:
            yield x_batch, y_batch
            x_batch, y_batch = [], []
        x_batch.append(inp)
        y_batch.append(tgt)

    if len(x_batch) != 0:
        for inp, tgt in zip(inputs, targets):
            if len(x_batch) != minibatch_size:
                x_batch.append(inp)
                y_batch.append(tgt)
            else:
                break
        yield x_batch, y_batch


def pad_sequences(sequences, pad_tok, tail=True):
    """Pads the sentences, so that all sentences in a batch have the same length.
    """

    max_length = max(len(x) for x in sequences)

    sequence_padded, sequence_length = [], []

    for seq in sequences:
        seq = list(seq)
        if tail:
            seq_ = seq[:max_length] + [pad_tok] * max(max_length - len(seq), 0)
        else:
            seq_ = [pad_tok] * max(max_length - len(seq), 0) + seq[:max_length]

        sequence_padded += [seq_]
        sequence_length += [min(len(seq), max_length)]

    return sequence_padded, sequence_length


def sample_results(preds, ind2word, word2ind, converted_summaries, converted_texts, use_bleu=False):
    """Plots the actual text and summary and the corresponding created summary.
    takes care of whether beam search or greedy decoder was used.
    """
    beam = False

    if len(np.array(preds).shape) == 4:
        beam = True

    '''Bleu score is not used correctly here, but serves as reference.
    '''
    if use_bleu:
        bleu_scores = []

    for pred, summary, text, seq_length in zip(preds[0],
                                               converted_summaries,
                                               converted_texts,
                                               [len(inds) for inds in converted_summaries]):
        print('\n\n\n', 100 * '-')
        if beam:
            actual_text = [ind2word[word] for word in text if
                           word != word2ind["<SOS>"] and word != word2ind["<EOS>"]]
            actual_summary = [ind2word[word] for word in summary if
                              word != word2ind['<EOS>'] and word != word2ind['<SOS>']]

            created_summary = []
            for word in pred:
                if word[0] != word2ind['<SOS>'] and word[0] != word2ind['<EOS>']:
                    created_summary.append(ind2word[word[0]])
                    continue
                else:
                    continue

            print('Actual Text:\n{}\n'.format(' '.join(actual_text)))
            print('Actual Summary:\n{}\n'.format(' '.join(actual_summary)))
            print('Created Summary:\n{}\n'.format(' '.join(created_summary)))
            if use_bleu:
                bleu_score = sentence_bleu([actual_summary], created_summary)
                bleu_scores.append(bleu_score)
                print('Bleu-score:', bleu_score)

            print()


        else:
            actual_text = [ind2word[word] for word in text if
                           word != word2ind["<SOS>"] and word != word2ind["<EOS>"]]
            actual_summary = [ind2word[word] for word in summary if
                              word != word2ind['<EOS>'] and word != word2ind['<SOS>']]
            created_summary = [ind2word[word] for word in pred if
                               word != word2ind['<EOS>'] and word != word2ind['<SOS>']]

            print('Actual Text:\n{}\n'.format(' '.join(actual_text)))
            print('Actual Summary:\n{}\n'.format(' '.join(actual_summary)))
            print('Created Summary:\n{}\n'.format(' '.join(created_summary)))
            if use_bleu:
                bleu_score = sentence_bleu([actual_summary], created_summary)
                bleu_scores.append(bleu_score)
                print('Bleu-score:', bleu_score)

    if use_bleu:
        bleu_score = np.mean(bleu_scores)
        print('\n\n\nTotal Bleu Score:', bleu_score)


def reset_graph(seed=97):
    """helper function to reset the default graph. this often
       comes handy when using jupyter noteboooks.
    """
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

#### summarizer_data_utils

In [0]:
import os
import time
import re
import html
from collections import Counter

import nltk
import numpy as np


def preprocess_sentence(text, keep_most=False):
    """
    Helper function to remove html, unneccessary spaces and punctuation.
    Args:
        text: String.
        keep_most: Boolean. depending if True or False, we either
                   keep only letters and numbers or also other characters.
    Returns:
        processed text.
    """
    text = text.lower()
    text = fixup(text)
    text = re.sub(r"<br />", " ", text)
    if keep_most:
        text = re.sub(r"[^a-z0-9%!?.,:()/]", " ", text)
    else:
        text = re.sub(r"[^a-z0-9]", " ", text)
    text = re.sub(r"    ", " ", text)
    text = re.sub(r"   ", " ", text)
    text = re.sub(r"  ", " ", text)
    text = text.strip()
    return text


def fixup(x):
    re1 = re.compile(r'  +')
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
        ' @-@ ', '-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))


def preprocess(text, keep_most=False):
    """
    Splits the text into sentences, preprocesses
       and tokenizes each sentence.
    Args:
        text: String. multiple sentences.
        keep_most: Boolean. depending if True or False, we either
                   keep only letters and numbers or also other characters.
    Returns:
        preprocessed and tokenized text.
    """
    tokenized = []
    for sentence in nltk.sent_tokenize(text):
        sentence = preprocess_sentence(sentence, keep_most)
        sentence = nltk.word_tokenize(sentence)
        for token in sentence:
            tokenized.append(token)

    return tokenized


def preprocess_texts_and_summaries(texts,
                                   summaries,
                                   keep_most=False):
    """iterates given list of texts and given list of summaries and tokenizes every
       review using the tokenize_review() function.
       apart from that we count up all the words in the texts and summaries.
       returns: - processed texts
                - processed summaries
                - array containing all the unique words together with their counts
                  sorted by counts.
    """

    start_time = time.time()
    processed_texts = []
    processed_summaries = []
    words = []

    for text in texts:
        text = preprocess(text, keep_most)
        for word in text:
            words.append(word)
        processed_texts.append(text)
    for summary in summaries:
        summary = preprocess(summary, keep_most)
        for word in summary:
            words.append(word)

        processed_summaries.append(summary)
    words_counted = Counter(words).most_common()
    print('Processing Time: ', time.time() - start_time)

    return processed_texts, processed_summaries, words_counted


def create_word_inds_dicts(words_counted,
                           specials=None,
                           min_occurences=0):
    """ creates lookup dicts from word to index and back.
        returns the lookup dicts and an array of words that were not used,
        due to rare occurence.
    """
    missing_words = []
    word2ind = {}
    ind2word = {}
    i = 0

    if specials is not None:
        for sp in specials:
            word2ind[sp] = i
            ind2word[i] = sp
            i += 1

    for (word, count) in words_counted:
        if count >= min_occurences:
            word2ind[word] = i
            ind2word[i] = word
            i += 1
        else:
            missing_words.append(word)

    return word2ind, ind2word, missing_words


def convert_sentence(review, word2ind):
    """ converts the given sent to int values corresponding to the given word2ind"""
    inds = []
    unknown_words = []

    for word in review:
        if word in word2ind.keys():
            inds.append(int(word2ind[word]))
        else:
            inds.append(int(word2ind['<UNK>']))
            unknown_words.append(word)

    return inds, unknown_words


def convert_to_inds(input, word2ind, eos=False, sos=False):
    converted_input = []
    all_unknown_words = set()

    for inp in input:
        converted_inp, unknown_words = convert_sentence(inp, word2ind)
        if eos:
            converted_inp.append(word2ind['<EOS>'])
        if sos:
            converted_inp.insert(0, word2ind['<SOS>'])
        converted_input.append(converted_inp)
        all_unknown_words.update(unknown_words)

    return converted_input, all_unknown_words


def convert_inds_to_text(inds, ind2word, preprocess=False):
    """ convert the given indexes back to text """
    words = [ind2word[word] for word in inds]
    return words


def load_pretrained_embeddings(path):
    """loads pretrained embeddings. stores each embedding in a
       dictionary with its corresponding word
    """
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split(' ')
            word = values[0]
            embedding_vector = np.array(values[1:], dtype='float32')
            embeddings[word] = embedding_vector
    return embeddings


def create_and_save_embedding_matrix(word2ind,
                                     pretrained_embeddings_path,
                                     save_path,
                                     embedding_dim=300):
    """creates embedding matrix for each word in word2ind. if that words is in
       pretrained_embeddings, that vector is used. otherwise initialized
       randomly.
    """
    pretrained_embeddings = load_pretrained_embeddings(pretrained_embeddings_path)
    embedding_matrix = np.zeros((len(word2ind), embedding_dim), dtype=np.float32)
    for word, i in word2ind.items():
        if word in pretrained_embeddings.keys():
            embedding_matrix[i] = pretrained_embeddings[word]
        else:
            embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
            embedding_matrix[i] = embedding
    if not os.path.exists(os.path.dirname(save_path)):
        os.makedirs(os.path.dirname(save_path))
    np.save(save_path, embedding_matrix)
    return np.array(embedding_matrix)

#### Summarizer

In [0]:
import os
import numpy as np

import tensorflow as tf
from tensorflow.python.layers.core import Dense

#import summarizer_model_utils


class Summarizer:

    def __init__(self,
                 word2ind,
                 ind2word,
                 save_path,
                 mode='TRAIN',
                 num_layers_encoder=1,
                 num_layers_decoder=1,
                 embedding_dim=300,
                 rnn_size_encoder=256,
                 rnn_size_decoder=256,
                 learning_rate=0.001,
                 learning_rate_decay=0.9,
                 learning_rate_decay_steps=100,
                 max_lr=0.01,
                 keep_probability=0.8,
                 batch_size=64,
                 beam_width=10,
                 epochs=20,
                 eos="<EOS>",
                 sos="<SOS>",
                 pad='<PAD>',
                 clip=5,
                 inference_targets=False,
                 pretrained_embeddings_path=None,
                 summary_dir=None,
                 use_cyclic_lr=False):
        """
        Args:
            word2ind: lookup dict from word to index.
            ind2word: lookup dict from index to word.
            save_path: path to save the tf model to in the end.
            mode: String. 'TRAIN' or 'INFER'. depending on which mode we use
                  a different graph is created.
            num_layers_encoder: Float. Number of encoder layers. defaults to 1.
            num_layers_decoder: Float. Number of decoder layers. defaults to 1.
            embedding_dim: dimension of the embedding vectors in the embedding matrix.
                           every word has a embedding_dim 'long' vector.
            rnn_size_encoder: Integer. number of hidden units in encoder. defaults to 256.
            rnn_size_decoder: Integer. number of hidden units in decoder. defaults to 256.
            learning_rate: Float.
            learning_rate_decay: only if exponential learning rate is used.
            learning_rate_decay_steps: Integer.
            max_lr: only used if cyclic learning rate is used.
            keep_probability: Float.
            batch_size: Integer. Size of minibatches.
            beam_width: Integer. Only used in inference, for Beam Search.('INFER'-mode)
            epochs: Integer. Number of times the training is conducted
                    on the whole training data.
            eos: EndOfSentence tag.
            sos: StartOfSentence tag.
            pad: Padding tag.
            clip: Value to clip the gradients to in training process.
            inference_targets:
            pretrained_embeddings_path: Path to pretrained embeddings. Has to be .npy
            summary_dir: Directory the summaries are written to for tensorboard.
            use_cyclic_lr: Boolean.
        """

        self.word2ind = word2ind
        self.ind2word = ind2word
        self.vocab_size = len(word2ind)
        self.num_layers_encoder = num_layers_encoder
        self.num_layers_decoder = num_layers_decoder
        self.rnn_size_encoder = rnn_size_encoder
        self.rnn_size_decoder = rnn_size_decoder
        self.save_path = save_path
        self.embedding_dim = embedding_dim
        self.mode = mode.upper()
        self.learning_rate = learning_rate
        self.learning_rate_decay = learning_rate_decay
        self.learning_rate_decay_steps = learning_rate_decay_steps
        self.keep_probability = keep_probability
        self.batch_size = batch_size
        self.beam_width = beam_width
        self.eos = eos
        self.sos = sos
        self.clip = clip
        self.pad = pad
        self.epochs = epochs
        self.inference_targets = inference_targets
        self.pretrained_embeddings_path = pretrained_embeddings_path
        self.use_cyclic_lr = use_cyclic_lr
        self.max_lr = max_lr
        self.summary_dir = summary_dir

    def build_graph(self):
        self.add_placeholders()
        self.add_embeddings()
        self.add_lookup_ops()
        self.initialize_session()
        self.add_seq2seq()
        self.saver = tf.train.Saver()
        print('Graph built.')

    def add_placeholders(self):
        self.ids_1 = tf.placeholder(tf.int32,
                                    shape=[None, None],
                                    name='ids_source')
        self.ids_2 = tf.placeholder(tf.int32,
                                    shape=[None, None],
                                    name='ids_target')
        self.sequence_lengths_1 = tf.placeholder(tf.int32,
                                                 shape=[None],
                                                 name='sequence_length_source')
        self.sequence_lengths_2 = tf.placeholder(tf.int32,
                                                 shape=[None],
                                                 name='sequence_length_target')
        self.maximum_iterations = tf.reduce_max(self.sequence_lengths_2,
                                                name='max_dec_len')

    def create_word_embedding(self, embed_name, vocab_size, embed_dim):
        """Creates embedding matrix in given shape - [vocab_size, embed_dim].
        """
        embedding = tf.get_variable(embed_name,
                                    shape=[vocab_size, embed_dim],
                                    dtype=tf.float32)
        return embedding

    def add_embeddings(self):
        """Creates the embedding matrix. In case path to pretrained embeddings is given,
           that embedding is loaded. Otherwise created.
        """
        if self.pretrained_embeddings_path is not None:
            self.embedding = tf.Variable(np.load(self.pretrained_embeddings_path),
                                         name='embedding')
            print('Loaded pretrained embeddings.')
        else:
            self.embedding = self.create_word_embedding('embedding',
                                                        self.vocab_size,
                                                        self.embedding_dim)

    def add_lookup_ops(self):
        """Additional lookup operation for both source embedding and target embedding matrix.
        """
        self.word_embeddings_1 = tf.nn.embedding_lookup(self.embedding,
                                                        self.ids_1,
                                                        name='word_embeddings_1')
        self.word_embeddings_2 = tf.nn.embedding_lookup(self.embedding,
                                                        self.ids_2,
                                                        name='word_embeddings_2')

    def make_rnn_cell(self, rnn_size, keep_probability):
        """Creates LSTM cell wrapped with dropout.
        """
        cell = tf.nn.rnn_cell.LSTMCell(rnn_size)
        cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_probability)
        return cell

    def make_attention_cell(self, dec_cell, rnn_size, enc_output, lengths, alignment_history=False):
        """Wraps the given cell with Bahdanau Attention.
        """
        attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(num_units=rnn_size,
                                                                   memory=enc_output,
                                                                   memory_sequence_length=lengths,
                                                                   name='BahdanauAttention')

        return tf.contrib.seq2seq.AttentionWrapper(cell=dec_cell,
                                                   attention_mechanism=attention_mechanism,
                                                   attention_layer_size=None,
                                                   output_attention=False,
                                                   alignment_history=alignment_history)

    def triangular_lr(self, current_step):
        """cyclic learning rate - exponential range."""
        step_size = self.learning_rate_decay_steps
        base_lr = self.learning_rate
        max_lr = self.max_lr

        cycle = tf.floor(1 + current_step / (2 * step_size))
        x = tf.abs(current_step / step_size - 2 * cycle + 1)
        lr = base_lr + (max_lr - base_lr) * tf.maximum(0.0, tf.cast((1.0 - x), dtype=tf.float32)) * (0.99999 ** tf.cast(
            current_step,
            dtype=tf.float32))
        return lr


    def add_seq2seq(self):
        """Creates the sequence to sequence architecture."""
        with tf.variable_scope('dynamic_seq2seq', dtype=tf.float32):
            # Encoder
            encoder_outputs, encoder_state = self.build_encoder()

            # Decoder
            logits, sample_id, final_context_state = self.build_decoder(encoder_outputs,
                                                                        encoder_state)
            if self.mode == 'TRAIN':

                # Loss
                loss = self.compute_loss(logits)
                self.train_loss = loss
                self.eval_loss = loss
                self.global_step = tf.Variable(0, trainable=False)


                # cyclic learning rate
                if self.use_cyclic_lr:
                    self.learning_rate = self.triangular_lr(self.global_step)

                # exponential learning rate
                else:
                    self.learning_rate = tf.train.exponential_decay(
                        self.learning_rate,
                        self.global_step,
                        decay_steps=self.learning_rate_decay_steps,
                        decay_rate=self.learning_rate_decay,
                        staircase=True)

                # Optimizer
                opt = tf.train.AdamOptimizer(self.learning_rate)


                # Gradients
                if self.clip > 0:
                    grads, vs = zip(*opt.compute_gradients(self.train_loss))
                    grads, _ = tf.clip_by_global_norm(grads, self.clip)
                    self.train_op = opt.apply_gradients(zip(grads, vs),
                                                        global_step=self.global_step)
                else:
                    self.train_op = opt.minimize(self.train_loss,
                                                 global_step=self.global_step)



            elif self.mode == 'INFER':
                loss = None
                self.infer_logits, _, self.final_context_state, self.sample_id = logits, loss, final_context_state, sample_id
                self.sample_words = self.sample_id

    def build_encoder(self):
        """The encoder. Bidirectional LSTM."""

        with tf.variable_scope("encoder"):
            fw_cell = self.make_rnn_cell(self.rnn_size_encoder // 2, self.keep_probability)
            bw_cell = self.make_rnn_cell(self.rnn_size_encoder // 2, self.keep_probability)

            for _ in range(self.num_layers_encoder):
                (out_fw, out_bw), (state_fw, state_bw) = tf.nn.bidirectional_dynamic_rnn(
                    cell_fw=fw_cell,
                    cell_bw=bw_cell,
                    inputs=self.word_embeddings_1,
                    sequence_length=self.sequence_lengths_1,
                    dtype=tf.float32)
                encoder_outputs = tf.concat((out_fw, out_bw), -1)

            bi_state_c = tf.concat((state_fw.c, state_bw.c), -1)
            bi_state_h = tf.concat((state_fw.h, state_bw.h), -1)
            bi_lstm_state = tf.nn.rnn_cell.LSTMStateTuple(c=bi_state_c, h=bi_state_h)
            encoder_state = tuple([bi_lstm_state] * self.num_layers_encoder)

            return encoder_outputs, encoder_state


    def build_decoder(self, encoder_outputs, encoder_state):

        sos_id_2 = tf.cast(self.word2ind[self.sos], tf.int32)
        eos_id_2 = tf.cast(self.word2ind[self.eos], tf.int32)
        self.output_layer = Dense(self.vocab_size, name='output_projection')

        # Decoder.
        with tf.variable_scope("decoder") as decoder_scope:

            cell, decoder_initial_state = self.build_decoder_cell(
                encoder_outputs,
                encoder_state,
                self.sequence_lengths_1)

            # Train
            if self.mode != 'INFER':

                helper = tf.contrib.seq2seq.ScheduledEmbeddingTrainingHelper(
                    inputs=self.word_embeddings_2,
                    sequence_length=self.sequence_lengths_2,
                    embedding=self.embedding,
                    sampling_probability=0.5,
                    time_major=False)

                # Decoder
                my_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                             helper,
                                                             decoder_initial_state,
                                                             output_layer=self.output_layer)

                # Dynamic decoding
                outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(
                    my_decoder,
                    output_time_major=False,
                    maximum_iterations=self.maximum_iterations,
                    swap_memory=False,
                    impute_finished=True,
                    scope=decoder_scope
                )

                sample_id = outputs.sample_id
                logits = outputs.rnn_output


            # Inference
            else:
                start_tokens = tf.fill([self.batch_size], sos_id_2)
                end_token = eos_id_2

                # beam search
                if self.beam_width > 0:
                    my_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
                        cell=cell,
                        embedding=self.embedding,
                        start_tokens=start_tokens,
                        end_token=end_token,
                        initial_state=decoder_initial_state,
                        beam_width=self.beam_width,
                        output_layer=self.output_layer,
                    )

                # greedy
                else:
                    helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(self.embedding,
                                                                      start_tokens,
                                                                      end_token)

                    my_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                                 helper,
                                                                 decoder_initial_state,
                                                                 output_layer=self.output_layer)
                if self.inference_targets:
                    maximum_iterations = self.maximum_iterations
                else:
                    maximum_iterations = None

                # Dynamic decoding
                outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(
                    my_decoder,
                    maximum_iterations=maximum_iterations,
                    output_time_major=False,
                    impute_finished=False,
                    swap_memory=False,
                    scope=decoder_scope)

                if self.beam_width > 0:
                    logits = tf.no_op()
                    sample_id = outputs.predicted_ids
                else:
                    logits = outputs.rnn_output
                    sample_id = outputs.sample_id

        return logits, sample_id, final_context_state

    def build_decoder_cell(self, encoder_outputs, encoder_state,
                           sequence_lengths_1):
        """Builds the attention decoder cell. If mode is inference performs tiling
           Passes last encoder state.
        """

        memory = encoder_outputs

        if self.mode == 'INFER' and self.beam_width > 0:
            memory = tf.contrib.seq2seq.tile_batch(memory,
                                                   multiplier=self.beam_width)
            encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state,
                                                          multiplier=self.beam_width)
            sequence_lengths_1 = tf.contrib.seq2seq.tile_batch(sequence_lengths_1,
                                                               multiplier=self.beam_width)
            batch_size = self.batch_size * self.beam_width

        else:
            batch_size = self.batch_size

        # MY APPROACH
        if self.num_layers_decoder is not None:
            lstm_cell = tf.nn.rnn_cell.MultiRNNCell(
                [self.make_rnn_cell(self.rnn_size_decoder, self.keep_probability) for _ in
                 range(self.num_layers_decoder)])

        else:
            lstm_cell = self.make_rnn_cell(self.rnn_size_decoder, self.keep_probability)

        # attention cell
        cell = self.make_attention_cell(lstm_cell,
                                        self.rnn_size_decoder,
                                        memory,
                                        sequence_lengths_1)

        decoder_initial_state = cell.zero_state(batch_size, tf.float32).clone(cell_state=encoder_state)

        return cell, decoder_initial_state


    def compute_loss(self, logits):
        """Compute the loss during optimization."""
        target_output = self.ids_2
        max_time = self.maximum_iterations

        target_weights = tf.sequence_mask(self.sequence_lengths_2,
                                          max_time,
                                          dtype=tf.float32,
                                          name='mask')

        loss = tf.contrib.seq2seq.sequence_loss(logits=logits,
                                                targets=target_output,
                                                weights=target_weights,
                                                average_across_timesteps=True,
                                                average_across_batch=True, )
        return loss


    def train(self,
              inputs,
              targets,
              restore_path=None,
              validation_inputs=None,
              validation_targets=None):
        """Performs the training process. Runs training step in every epoch.
           Shuffles input data before every epoch.
           Optionally: - add tensorboard summaries.
                       - restoring previous model and retraining on top.
                       - evaluation step.
        """
        assert len(inputs) == len(targets)

        if self.summary_dir is not None:
            self.add_summary()

        self.initialize_session()
        if restore_path is not None:
            self.restore_session(restore_path)

        best_score = np.inf
        nepoch_no_imprv = 0

        inputs = np.array(inputs)
        targets = np.array(targets)

        for epoch in range(self.epochs + 1):
            print('-------------------- Epoch {} of {} --------------------'.format(epoch,
                                                                                    self.epochs))

            # shuffle the input data before every epoch.
            shuffle_indices = np.random.permutation(len(inputs))
            inputs = inputs[shuffle_indices]
            targets = targets[shuffle_indices]

            # run training epoch
            score = self.run_epoch(inputs, targets, epoch)

            # evaluate model
            if validation_inputs is not None and validation_targets is not None:
                self.run_evaluate(validation_inputs, validation_targets, epoch)


            #if not os.path.exists(self.save_path):
            #        os.makedirs(self.save_path)
            #    self.saver.save(self.sess, self.save_path)
                
            if score <= best_score:
                nepoch_no_imprv = 0
                if not os.path.exists(self.save_path):
                    os.makedirs(self.save_path)
                self.saver.save(self.sess, self.save_path)
                best_score = score
                print("--- new best score ---\n\n")
            else:
                # warm up epochs for the model
                if epoch > 10:
                    nepoch_no_imprv += 1
                # early stopping
                if nepoch_no_imprv >= 5:
                    print("- early stopping {} epochs without improvement".format(nepoch_no_imprv))
                    break

    def infer(self, inputs, restore_path, targets=None):
        """Runs inference process. No training takes place.
           Returns the predicted ids for every sentence.
        """
        self.initialize_session()
        self.restore_session(restore_path)

        prediction_ids = []
        if targets is not None:
            feed, _, sequence_lengths_2 = self.get_feed_dict(inputs, trgts=targets)
        else:
            feed, _ = self.get_feed_dict(inputs)

        infer_logits, s_ids = self.sess.run([self.infer_logits, self.sample_words], feed_dict=feed)
        prediction_ids.append(s_ids)

        # for (inps, trgts) in summarizer_model_utils.minibatches(inputs, targets, self.batch_size):
        #     feed, _, sequence_lengths= self.get_feed_dict(inps, trgts=trgts)
        #     infer_logits, s_ids = self.sess.run([self.infer_logits, self.sample_words], feed_dict = feed)
        #     prediction_ids.append(s_ids)

        return prediction_ids

    def run_epoch(self, inputs, targets, epoch):
        """Runs a single epoch.
           Returns the average loss value on the epoch."""
        batch_size = self.batch_size
        nbatches = (len(inputs) + batch_size - 1) // batch_size
        losses = []

        for i, (inps, trgts) in enumerate(minibatches(inputs,
                                                                             targets,
                                                                             batch_size)):
            if inps is not None and trgts is not None:
                fd, sl, s2 = self.get_feed_dict(inps,
                                                trgts=trgts)

                if i % 10 == 0 and self.summary_dir is not None:
                    _, train_loss, training_summ = self.sess.run([self.train_op,
                                                                  self.train_loss,
                                                                  self.training_summary],
                                                                 feed_dict=fd)
                    self.training_writer.add_summary(training_summ, epoch*nbatches + i)

                else:
                    _, train_loss = self.sess.run([self.train_op, self.train_loss],
                                                  feed_dict=fd)

                if i % 2 == 0 or i == (nbatches - 1):
                    print('Iteration: {} of {}\ttrain_loss: {:.4f}'.format(i, nbatches - 1, train_loss))
                losses.append(train_loss)

            else:
                print('Minibatch empty.')
                continue

        avg_loss = self.sess.run(tf.reduce_mean(losses))
        print('Average Score for this Epoch: {}'.format(avg_loss))

        return avg_loss

    def run_evaluate(self, inputs, targets, epoch):
        """Runs evaluation on validation inputs and targets.
        Optionally: - writes summary to Tensorboard.
        """
        if self.summary_dir is not None:
            eval_losses = []
            for inps, trgts in minibatches(inputs, targets, self.batch_size):
                fd, sl, s2 = self.get_feed_dict(inps, trgts)
                eval_loss = self.sess.run([self.eval_loss], feed_dict=fd)
                eval_losses.append(eval_loss)

            avg_eval_loss = self.sess.run(tf.reduce_mean(eval_losses))

            print('Eval_loss: {}\n'.format(avg_eval_loss))
            eval_summ = self.sess.run([self.eval_summary], feed_dict=fd)
           # self.eval_writer.add_summary(eval_summ, epoch)

        else:
            eval_losses = []
            for inps, trgts in minibatches(inputs, targets, self.batch_size):
                fd, sl, s2 = self.get_feed_dict(inps, trgts)
                eval_loss = self.sess.run([self.eval_loss], feed_dict=fd)
                eval_losses.append(eval_loss)

            avg_eval_loss = self.sess.run(tf.reduce_mean(eval_losses))

            print('Eval_loss: {}\n'.format(avg_eval_loss))



    def get_feed_dict(self, inps, trgts=None):
        """Creates the feed_dict that is fed into training or inference network.
           Pads inputs and targets.
           Returns feed_dict and sequence_length(s) depending on training mode.
        """
        if self.mode != 'INFER':
            inp_ids, sequence_lengths_1 = pad_sequences(inps,
                                                                               self.word2ind[self.pad],
                                                                               tail=False)

            feed = {
                self.ids_1: inp_ids,
                self.sequence_lengths_1: sequence_lengths_1
            }

            if trgts is not None:
                trgt_ids, sequence_lengths_2 = pad_sequences(trgts,
                                                                                    self.word2ind[self.pad],
                                                                                    tail=True)
                feed[self.ids_2] = trgt_ids
                feed[self.sequence_lengths_2] = sequence_lengths_2

                return feed, sequence_lengths_1, sequence_lengths_2

        else:

            inp_ids, sequence_lengths_1 = pad_sequences(inps,
                                                                               self.word2ind[self.pad],
                                                                               tail=False)

            feed = {
                self.ids_1: inp_ids,
                self.sequence_lengths_1: sequence_lengths_1
            }

            if trgts is not None:
                trgt_ids, sequence_lengths_2 = pad_sequences(trgts,
                                                                                    self.word2ind[self.pad],
                                                                                    tail=True)

                feed[self.sequence_lengths_2] = sequence_lengths_2

                return feed, sequence_lengths_1, sequence_lengths_2
            else:
                return feed, sequence_lengths_1

    def initialize_session(self):
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def restore_session(self, restore_path):
        self.saver.restore(self.sess, restore_path)
        print('Done.')

    def add_summary(self):
        """Summaries for Tensorboard."""
        self.training_summary = tf.summary.scalar('training_loss', self.train_loss)
        self.eval_summary = tf.summary.scalar('evaluation_loss', self.eval_loss)
        self.training_writer = tf.summary.FileWriter(self.summary_dir,
                                                     tf.get_default_graph())
        self.eval_writer = tf.summary.FileWriter(self.summary_dir)

## The data


The data we will be using with is a dataset from Kaggle, the Amazon Fine Food Reviews dataset.  
It contains, as the name suggests, 570.000 reviews of fine foods from Amazon and summaries of those reviews. 
Our aim is to input a review (Text column) and automatically create a summary (Summary colum) for it.


https://www.kaggle.com/snap/amazon-fine-food-reviews/data

### Reading and exploring

In [7]:
# load csv file using pandas.
file_path = "drive/Colab Notebooks/Menu/Data/Reviews.csv"
data = pd.read_csv(file_path)
data.shape

(568454, 10)

In [8]:
# we will only use the last two columns Summary (target) and Text (input).
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [9]:
# check for missings --> got some in summary drop those. 
# 26 are missing, so we will drop those!
data.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [0]:
# drop row, if values in Summary is missing. 
data.dropna(subset=['Summary'],inplace = True)

In [11]:
# only summary and text are useful for us.
data = data[['Summary', 'Text']]
data.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [0]:
# we will not use all of them, only short ones and ones of similar size. 
# choosing the ones that are of similar length makes it easier for the model to learn.
raw_texts = []
raw_summaries = []

for text, summary in zip(data.Text, data.Summary):
    if 100< len(text) < 150:
        raw_texts.append(text)
        raw_summaries.append(summary)

In [13]:
len(raw_texts), len(raw_summaries)

(78862, 78862)

In [14]:
for t, s in zip(raw_texts[:5], raw_summaries[:5]):
    print('Text:\n', t)
    print('Summary:\n', s, '\n\n')

Text:
 Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.
Summary:
 Great taffy 


Text:
 This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!
Summary:
 Wonderful, tasty taffy 


Text:
 Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too
Summary:
 Yay Barley 


Text:
 This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.
Summary:
 Healthy Dog Food 


Text:
 The Strawberry Twizzlers are my guilty pleasure - yummy. Six pounds will be around for a while with my son and I.
Summary:
 Strawberry Twizzlers - Yummy 




### Clean and prepare the data

In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [18]:
# the function gives us the option to keep_most of the characters inisde the texts and summaries, meaning
# punctuation, question marks, slashes...
# or we can set it to False, meaning we only want to keep letters and numbers like here.
processed_texts, processed_summaries, words_counted = preprocess_texts_and_summaries(
    raw_texts,
    raw_summaries,
    keep_most=False
)

Processing Time:  50.83246994018555


In [19]:
for t,s in zip(processed_texts[:5], processed_summaries[:5]):
    print('Text\n:', t, '\n')
    print('Summary:\n', s, '\n\n\n')

Text
: ['great', 'taffy', 'at', 'a', 'great', 'price', 'there', 'was', 'a', 'wide', 'assortment', 'of', 'yummy', 'taffy', 'delivery', 'was', 'very', 'quick', 'if', 'your', 'a', 'taffy', 'lover', 'this', 'is', 'a', 'deal'] 

Summary:
 ['great', 'taffy'] 



Text
: ['this', 'taffy', 'is', 'so', 'good', 'it', 'is', 'very', 'soft', 'and', 'chewy', 'the', 'flavors', 'are', 'amazing', 'i', 'would', 'definitely', 'recommend', 'you', 'buying', 'it', 'very', 'satisfying'] 

Summary:
 ['wonderful', 'tasty', 'taffy'] 



Text
: ['right', 'now', 'i', 'm', 'mostly', 'just', 'sprouting', 'this', 'so', 'my', 'cats', 'can', 'eat', 'the', 'grass', 'they', 'love', 'it', 'i', 'rotate', 'it', 'around', 'with', 'wheatgrass', 'and', 'rye', 'too'] 

Summary:
 ['yay', 'barley'] 



Text
: ['this', 'is', 'a', 'very', 'healthy', 'dog', 'food', 'good', 'for', 'their', 'digestion', 'also', 'good', 'for', 'small', 'puppies', 'my', 'dog', 'eats', 'her', 'required', 'amount', 'at', 'every', 'feeding'] 

Summary:
 ['

### Create lookup dicts

We cannot feed our network actual words, but numbers. So we first have to create our lookup dicts, where each words gets and int value (high or low, depending on its frequency in our corpus). Those help us to later convert the texts into numbers.

We also add special tokens. EndOfSentence and StartOfSentence are crucial for the Seq2Seq model we later use.
Pad token, because all summaries and texts in a batch need to have the same length, pad token helps us do that.

So we need 2 lookup dicts:
 - From word to index 
 - from index to word. 

In [21]:
specials = ["<EOS>", "<SOS>","<PAD>","<UNK>"]
word2ind, ind2word,  missing_words = create_word_inds_dicts(words_counted,
                                                                       specials = specials)
print(len(word2ind), len(ind2word), len(missing_words))


25067 25067 0


### Pretrained embeddings

Optionally we can use pretrained word embeddings. Those have proved to increase training speed and accuracy.
Here I used two different options. Either we use glove embeddings or embeddings from tf_hub.
The ones from tf_hub worked better, so we use those. 

In [0]:
# glove_embeddings_path = './glove.6B.300d.txt'
# embedding_matrix_save_path = './embeddings/my_embedding_github.npy'
# emb = summarizer_data_utils.create_and_save_embedding_matrix(word2ind,
#                                                              glove_embeddings_path,
#                                                              embedding_matrix_save_path)

In [29]:
# the embeddings from tf_hub. 
# embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")



embed = hub.Module("https://tfhub.dev/google/Wiki-words-250/1")


    
emb = embed([key for key in word2ind.keys()])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    embedding = sess.run(emb)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [30]:
embedding.shape

(25067, 250)

In [0]:
np.save('drive/Colab Notebooks/Modle 3/tf_hub_embedding.npy', embedding)

### Convert text and summaries

As I said before we cannot feed the words directly to our network, we have to convert them to numbers first of all. This is what we do here. And we also append the SOS and EOS tokens.

In [0]:
# converts words in texts and summaries to indices
# it looks like we have to set eos here to False
converted_texts, unknown_words_in_texts = convert_to_inds(processed_texts,
                                                                                word2ind,
                                                                                eos = False)

In [0]:
converted_summaries, unknown_words_in_summaries = convert_to_inds(processed_summaries,
                                                                                        word2ind,
                                                                                        eos = True,
                                                                                        sos = True)

In [37]:
converted_texts[0]

[12,
 1727,
 47,
 8,
 12,
 45,
 130,
 29,
 8,
 2728,
 1159,
 15,
 106,
 1727,
 322,
 29,
 25,
 249,
 62,
 101,
 8,
 1727,
 662,
 9,
 10,
 8,
 193]

In [38]:
# seems to have worked well. 
print( convert_inds_to_text(converted_texts[0], ind2word),
       convert_inds_to_text(converted_summaries[0], ind2word))


['great', 'taffy', 'at', 'a', 'great', 'price', 'there', 'was', 'a', 'wide', 'assortment', 'of', 'yummy', 'taffy', 'delivery', 'was', 'very', 'quick', 'if', 'your', 'a', 'taffy', 'lover', 'this', 'is', 'a', 'deal'] ['<SOS>', 'great', 'taffy', '<EOS>']


## The model

Now we can build and train our model. First we define the hyperparameters we want to use. Then we create our Summarizer and call the function .build_graph(), which as the name suggests, builds the computation graph. 
Then we can train the model using .train()

After training we can try our model using .infer()

### Training

We can optionally use a cyclic learning rate, which we do here. 
I trained the model for 20 epochs and the loss was low then, but we could train it longer and would probably get better results.

Unfortunately I do not have the resources to find the perfect (or right) hyperparameters, but these do pretty well. 


In [0]:
# model hyperparametes
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 512
rnn_size_decoder = 512

batch_size = 256
epochs = 200
clip = 5
keep_probability = 0.5
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 700
learning_rate_decay = 0.90


pretrained_embeddings_path = './tf_hub_embedding.npy'
summary_dir = os.path.join('./tensorboard', str('Nn_' + str(rnn_size_encoder) + '_Lr_' + str(learning_rate)))


use_cyclic_lr = True
inference_targets=True


In [40]:
len(converted_summaries)

78862

In [41]:
round(78862*0.9)

70976

In [75]:
# build graph and train the model 
reset_graph()
summarizer = Summarizer(word2ind,
                                   ind2word,
                                   save_path='drive/Colab Notebooks/Model 3/my_model',
                                   mode='TRAIN',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   batch_size = batch_size,
                                   clip = clip,
                                   keep_probability = keep_probability,
                                   learning_rate = learning_rate,
                                   max_lr=max_lr,
                                   learning_rate_decay_steps = learning_rate_decay_steps,
                                   learning_rate_decay = learning_rate_decay,
                                   epochs = epochs,
                                   pretrained_embeddings_path = pretrained_embeddings_path,
                                   use_cyclic_lr = use_cyclic_lr,
                                   summary_dir = summary_dir)           

summarizer.build_graph()
summarizer.train(converted_texts[:70976], 
                 converted_summaries[:70976],
                 validation_inputs=converted_texts[70976:],
                 validation_targets=converted_summaries[70976:])


# hidden training output.
# both train and validation loss decrease nicely.

Loaded pretrained embeddings.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Graph built.
-------------------- Epoch 0 of 200 --------------------
Iteration: 0 of 277	train_loss: 10.1294
Iteration: 2 of 277	train_loss: 10.1122
Iteration: 4 of 277	train_loss: 10.0395
Iteration: 6 of 277	train_loss: 9.6702
Iteration: 8 of 277	train_loss: 8.7834
Iteration: 10 of 277	train_loss: 7.7196
Iteration: 12 of 277	train_loss: 6.8165
Iteration: 14 of 277	train_loss: 6.0986
Iteration: 16 of 277	train_loss: 5.4761
Iteration: 18 of 277	train_loss: 5.0323
Iteration: 20 of 277	train_loss: 5.0313
Iteration: 22 of 277	train_loss: 4.8751
Iteration: 24 of 277	train_loss: 4.8908
Iteration: 26 of 277	train_loss: 4.7569
Iteration: 28 of 277	train_loss: 4.8678
Iteration: 30 of 277	train_loss: 4.9169
Iteration: 32 of 277	train_loss: 4.7730
Iteration: 34 of 277	train_loss: 4.8769
Iteration: 36 of 277	train_loss: 4.6928
Iteration: 38 of 277	train_loss: 4.7625
Iteration: 40 of 277	train_loss: 4.8740
Iteration: 42 of 277	train_loss: 4.6174
Iteration: 44 of 277	train_loss: 4.7879
Iteration: 4

KeyboardInterrupt: ignored

### Inference
Now we can use our trained model to create summaries. 

In [76]:
reset_graph()
summarizer = Summarizer(word2ind,
                                   ind2word,
                                   'drive/Colab Notebooks/Model 3/my_model',
                                   'INFER',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   batch_size = len(converted_texts[:50]),
                                   clip = clip,
                                   keep_probability = 1.0,
                                   learning_rate = 0.0,
                                   beam_width = 5,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   inference_targets = True,
                                   pretrained_embeddings_path = pretrained_embeddings_path)

summarizer.build_graph()
preds = summarizer.infer(converted_texts[:50],
                         restore_path = 'drive/Colab Notebooks/Model 3/my_model',
                         targets = converted_summaries[:50])


Loaded pretrained embeddings.
Graph built.
INFO:tensorflow:Restoring parameters from drive/Colab Notebooks/Model 3/my_model
Done.


In [77]:
# show results
sample_results(preds,
                                      ind2word,
                                      word2ind,
                                      converted_summaries[:50],
                                      converted_texts[:50])




 ----------------------------------------------------------------------------------------------------
Actual Text:
great taffy at a great price there was a wide assortment of yummy taffy delivery was very quick if your a taffy lover this is a deal

Actual Summary:
great taffy

Created Summary:
great taffy taffy





 ----------------------------------------------------------------------------------------------------
Actual Text:
this taffy is so good it is very soft and chewy the flavors are amazing i would definitely recommend you buying it very satisfying

Actual Summary:
wonderful tasty taffy

Created Summary:
delicious taffy taffy





 ----------------------------------------------------------------------------------------------------
Actual Text:
right now i m mostly just sprouting this so my cats can eat the grass they love it i rotate it around with wheatgrass and rye too

Actual Summary:
yay barley

Created Summary:
yay barley





 -----------------------------------------

# Conclusion

Generally I am really impressed by how well the model works. 
We only used a limited amount of data, trained it for a limited amount of time and used nearly random hyperparameters and it still delivers good results. 

However, we are clearly overfitting the training data and the model does not perfectly generalize.
Sometimes the summaries the model creates are good, sometimes bad, sometimes they are better than the original ones and sometimes they are just really funny.


Therefore it would be really interesting to scale it up and see how it performs. 

To sum up, I am impressed by seq2seq models, they perform great on many different tasks and I look foward to exploring more possible applications. 
(speech recognition...)