<a href="https://colab.research.google.com/github/GiuliaLanzillotta/exercises/blob/master/Tweet_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementation of **Generating sentences from a continuous space paper**
Here's a link to the [paper](https://arxiv.org/pdf/1511.06349v4.pdf).


> ### The goal 
 What I want to explore here is the expression of sentiment in generative models. <br>
 The dataset consists of two different samples of tweets, one with positive sentiment and one with negative sentiment. <br>
 The goal is to train two generators on the two sets separetly and analyse the qualitative differences.

# Data loading

In [1]:
!ls

drive  sample_data


In [2]:
!unzip 'drive/My Drive/twitter-datasets.zip'

Archive:  drive/My Drive/twitter-datasets.zip
  inflating: twitter-datasets/sample_submission.csv  
  inflating: twitter-datasets/test_data.txt  
  inflating: twitter-datasets/train_neg_full.txt  
  inflating: twitter-datasets/train_neg.txt  
  inflating: twitter-datasets/train_pos_full.txt  
  inflating: twitter-datasets/train_pos.txt  


In [0]:
positive_location = 'twitter-datasets/train_pos.txt'
negative_location = 'twitter-datasets/train_neg.txt'

An example of the raw data 

In [4]:
!head -3 'twitter-datasets/train_pos.txt'

<user> i dunno justin read my mention or not . only justin and god knows about that , but i hope you will follow me #believe 15
because your logic is so dumb , i won't even crop out your name or your photo . tsk . <url>
" <user> just put casper in a box ! " looved the battle ! #crakkbitch


In [5]:
!wc -l 'twitter-datasets/train_pos.txt'


100000 twitter-datasets/train_pos.txt


# Text preprocessing

In [0]:
VALIDATION_SPLIT = 0.2
MAX_SEQUENCE_LENGTH = 100
BATCH_SIZE = 1000

In [0]:
from nltk.tokenize.casual import TweetTokenizer
from collections import Counter
import pickle

In [0]:
def tokenize_text(text):
    """
    Transforms the specified files in tokens using the Twitter tokenizer.
    @:params: str
        Input text to tokenize
    @:return: list(str)
        Returns the tokens as a list of strings.
    """
    tokenizer = TweetTokenizer()
    # tokenizing the text
    tokens = tokenizer.tokenize(text)
    words = [w.lower() for w in tokens]
    return words

Now we build a vocabulary with the most frequent words

In [9]:
frequency_treshold = 100000*0.0001 # 0.1% of the tweets should contain the "frequent words"
frequency_treshold

10.0

In [0]:
def build_vocabulary(file_name, output_file_name):
  """
  Builds the vocabulary for the specified file.
  """
  words = []
  print("Reading ",file_name)
  raw = open(file_name,  "r").read()
  more_words = tokenize_text(raw)
  words.extend(more_words)
  # counting the words
  counter = Counter(words)
  words_count = dict(counter)
  # filtering 
  filtered_words = [k for k, v in words_count.items() if v >= frequency_treshold]
  # building voabulary 
  # index starts at 1 so that the 0 can be left for the empty spaces
  vocab = {k:i+1 for i,k in enumerate(filtered_words)}
  # saving the vocabulary 
  with open(output_file_name, 'wb') as f: 
      pickle.dump(vocab, f, pickle.HIGHEST_PROTOCOL)
  return vocab

In [11]:
vocab_pos = build_vocabulary(positive_location, "vocab_pos.pkl")
vocab_neg = build_vocabulary(negative_location, "vocab_neg.pkl")

Reading  twitter-datasets/train_pos.txt
Reading  twitter-datasets/train_neg.txt


In [12]:
len(vocab_pos.keys())

5711

In [13]:
len(vocab_neg.keys())

9641

In [0]:
def filter_sentence(sentence, vocab):
  """
  Filters the given sentence with the given vocabulary. 
    The words that are not in the vocabulary will be filtered out 
    of the sentence. 
  """
  return [word for word in sentence if word in vocab.keys()]

In [0]:
def sentence_to_sequence(sentence, vocab):
  """
  (Filtering included)
  Transforms the given sentence into a sequence of ints, 
  corresponding to the index of the word.
  """
  filtered_sentence = filter_sentence(sentence, vocab)
  return [vocab[word] for word in filtered_sentence]

# Sentences handler
During training we want to be able to load the tweets from the file sequentially for memory efficiency. <br>
Here I implement the helper functions that will do that


In [0]:
def load_new_chunk(start, chunksize, file):
  """
  Loads -chunksize- tweets from the specified file, starting from 
  -start- tweet (excluded). 
  """
  f=open(file)
  lines=f.readlines()
  selected = lines[start:start+chunksize]
  return selected 

In [17]:
load_new_chunk(99997, 5, positive_location)

["<user> <user> um gord ... i just read your profile . i'm not sure i can have lunch with a riders fan\n",
 "<user> i'm so excited for tomorrow ! look out for two leprechauns ! xx\n",
 'i always wondered what the job application is like at hooters .. do they just give you a bra and say , " here fill this out " .. ? "\n']

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
validation_size = int(VALIDATION_SPLIT*100000)

In [0]:
def input_generator(file, chunksize, max_seq_length, vocab):
  """
  The output of the generator must be a tuple (inputs, targets)
  This tuple (a single output of the generator) makes a single batch
  Different batches may have different sizes. For example, 
  the last batch of the epoch is commonly smaller than the others, 
  if the size of the dataset is not divisible by the batch size. 
  """
  start = 0 
  while start < 100000 - validation_size:
    sentences = load_new_chunk(start,chunksize,file)
    sequences = [sentence_to_sequence(sentence,vocab) for sentence in sentences]
    padded_sequences = pad_sequences(sequences, maxlen=max_seq_length, padding="pre", value=0)
    start +=chunksize
    yield padded_sequences, padded_sequences

In [0]:
def get_validation_data(file, max_seq_length, vocab):
  """
  """
  start = 100000 - validation_size
  sentences = load_new_chunk(start,validation_size,file)
  sequences = [sentence_to_sequence(sentence,vocab) for sentence in sentences]
  padded_sequences = pad_sequences(sequences, maxlen=max_seq_length, padding="pre", value=0)
  return padded_sequences

In [0]:
validation_pos = get_validation_data(positive_location,MAX_SEQUENCE_LENGTH,vocab_pos)
validation_neg = get_validation_data(negative_location,MAX_SEQUENCE_LENGTH,vocab_neg)

# Embeddings 

To embed the words we are going to use a pre-trained GloVe model as a *prior* on the embedding that we'll be inserted as a first layer to the final model.

In [24]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-04-16 15:15:47--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-04-16 15:15:47--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-04-16 15:15:48--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [25]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


We are going to use the *200d* embedding file 

In [0]:
glove_location = 'glove.6B.200d.txt'

In [0]:
import numpy as np

In [0]:
embeddings_index = {}
f = open(glove_location, encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [0]:
def build_emb_matrix(glove, vocab):
  """
  Build the pre-trained GloVe embedding on the given vocabulary. 
  The GloVe embedding should be presented as a vocabulary. 
  """
  print("Building the GloVe embedding.")
  N = len(vocab.keys()) + 1 # number of words in the vocabulary 
  D = 200 # embedding dimension
  emb_matrix = np.zeros((N,D))
  for idx, word in enumerate(vocab.keys()):
    embedding_vector = glove.get(word)
    if embedding_vector is not None: emb_matrix[idx+1] = embedding_vector
    else: emb_matrix[idx+1] = embeddings_index.get('unk')
  return emb_matrix

In [30]:
emb_matrix_pos = build_emb_matrix(embeddings_index, vocab_pos)
emb_matrix_neg = build_emb_matrix(embeddings_index, vocab_neg)

Building the GloVe embedding.
Building the GloVe embedding.


In [31]:
emb_matrix_pos.shape

(5712, 200)

In [0]:
np.savez("emb_matrix_pos",emb_matrix_pos)
np.savez("emb_matrix_neg",emb_matrix_neg)

In [0]:
emb_matrix_pos = np.load("emb_matrix_pos.npz")["arr_0"]
emb_matrix_neg = np.load("emb_matrix_neg.npz")["arr_0"]

# VAE model
To make my life simpler I chose to use Keras to build the model. 

In [34]:
from keras.layers.advanced_activations import ELU

Using TensorFlow backend.


#### Hyperparameters first. 

In [0]:
LATENT_DIM = 64
HIDDEN_DIM = 96

#### text VAE class

In the following cells we'll build a **Model** class and its **Layers**' classes

In [0]:
import tensorflow as tf
from tensorflow.keras import layers

In [0]:
class Sampling(layers.Layer):
  """
  Sampling layer. 
  The sampling will be from the posterior on the latent vector, 
  whose prior is a standard Gaussian. 
  """
  def call(self, inputs):
    z_mean, z_log_var = inputs
    # batch and latent dimensionality
    batch_size = tf.shape(z_mean)[0] 
    latent_dim = tf.shape(z_mean)[1]
    epsilon = tf.keras.backend.random_normal(shape=(batch_size, latent_dim))
    return z_mean + tf.exp(0.5 * z_log_var) * epsilon


In [0]:
class Embedding(layers.Layer):
  """
  Embedding layer. 
  Responsible for training the embedding. 
  """
  def __init__(self,
               vocab_dim, # (number of words x embemdding dimension)
               seq_length, #  maximum seq_length
               emb_matrix= None, # pre-trained embeddings
               name='embedding',
               to_train= True, # wether to train the embeddings
               **kwargs):
    super(Embedding, self).__init__(name=name, **kwargs)
    N, D = vocab_dim
    self.embedding = layers.Embedding(N, D, weights=[emb_matrix],
                                      input_length=seq_length, trainable=to_train)

  def call(self, inputs):
    x = self.embedding(inputs)
    return x

In [0]:
class Encoder(layers.Layer):
  """
  Encoder layer. 
  This layer parametrizes the mapping from the input  
  to a latent space distribution. 
  Concretely it will map sentences to the triple (z_mean, z_log_var, z).
  """
  
  def __init__(self,
               vocab_dim, # (number of words x embemdding dimension)
               seq_length, #  maximum seq_length
               latent_dim=LATENT_DIM,
               hidden_dim=HIDDEN_DIM,
               name='encoder',
               dropout_rate=0.2,
               activation_fun='elu',
               **kwargs):
    super(Encoder, self).__init__(name=name, **kwargs)
    N, D = vocab_dim
    self.recurrent_layer = layers.Bidirectional(layers.LSTM(hidden_dim, 
                            return_sequences=False, recurrent_dropout=dropout_rate,
                            input_shape=(None,seq_length,D)), merge_mode='concat')
    self.dense_proj = layers.Dense(hidden_dim, activation=activation_fun)
    self.dense_mean = layers.Dense(latent_dim)
    self.dense_log_var = layers.Dense(latent_dim)
    self.sampling = Sampling()

  def call(self, inputs):
    # Input must be of shape (None, max_len)
    h = self.recurrent_layer(inputs)
    x = self.dense_proj(h)
    z_mean = self.dense_mean(x)
    z_log_var = self.dense_log_var(x)
    z = self.sampling((z_mean, z_log_var))
    return z_mean, z_log_var, z

In [0]:
class Decoder(layers.Layer):
  """
  Decoder layer. 
  This layer parametrizes the mapping from the latent space to the 
  output dimension. 
  Concretely it will map a z to a readable sentence.
  """
  def __init__(self,
               vocab_dim, # (number of words x embemdding dimension)
               seq_length, #  maximum seq_length
               hidden_dim=HIDDEN_DIM,
               latent_dim=LATENT_DIM,
               name='decoder',
               dropout_rate=0.2,
               **kwargs):
    super(Decoder, self).__init__(name=name, **kwargs)
    N, D = vocab_dim
    self.latent2hidden = layers.Dense(hidden_dim,activation='linear',
                                      input_shape=(None,latent_dim))
    self.recurrent_layer = layers.LSTM(hidden_dim, return_sequences=True, 
                                       recurrent_dropout=dropout_rate, 
                                       input_shape=(None,seq_length,D))
    self.mean = layers.TimeDistributed(layers.Dense(N, activation='linear'))

  def call(self, inputs):
    x = inputs[0]
    z = inputs[1]
    mask = inputs[2]
    hidden = self.latent2hidden(z)
    h_decoded = self.recurrent_layer(x, initial_state=[hidden,hidden], mask=mask)
    x_decoded_mean = self.mean(h_decoded)
    return h_decoded, x_decoded_mean

In [0]:
from tensorflow.keras import backend as K
import tensorflow_addons as tfa

In [0]:
class textVAE(tf.keras.Model):
  """
  Model class for VAE.
  Combines the embedding, encoder and decoder into an 
  end-to-end model for training."""

  def __init__(self,
               vocab_dim, # (number of words x embemdding dimension)
               seq_length, #  maximum seq_length
               unk_embedding, # embedding for unknown word
               emb_matrix=None, # pre-trained embeddings - if None, they will be trained
               train_emb = True,
               hidden_dim=HIDDEN_DIM,
               latent_dim=LATENT_DIM,
               word_dropout_rate=0.75,
               dropout_rate=0.2,
               name='textVAE',
               **kwargs):
    super(textVAE, self).__init__(name=name, **kwargs)
    self.N, self.D = vocab_dim
    self.unk_embedding = unk_embedding
    self.seq_length = seq_length
    self.word_dropout_rate = word_dropout_rate
    self.embedding = Embedding(vocab_dim=vocab_dim,
                               seq_length=seq_length,
                               emb_matrix=emb_matrix,
                               to_train=train_emb)
    self.encoder = Encoder(vocab_dim=vocab_dim,
                           seq_length=seq_length,
                           latent_dim=latent_dim,
                           hidden_dim=hidden_dim, 
                           dropout_rate=dropout_rate)
    self.decoder = Decoder(vocab_dim=vocab_dim, 
                           seq_length=seq_length,
                           hidden_dim=hidden_dim, 
                           latent_dim=latent_dim,
                           dropout_rate=dropout_rate)
    
  def get_mask(self, input_sequence):
    """
    Automatically drops self.word_dropout_rate% of words 
    in the input. (To call before forwarding to the decoder.)
    """
    #input shape = batch size x seq length x emb dimension
    shape = K.shape(input_sequence)
    prob = tf.random.uniform(shape,0,1,tf.float32)
    mask = prob < self.word_dropout_rate 
    return mask

  def compute_loss(self, z_mean, z_log_var, z, 
                   x, x_decoded_mean, target_weights):
    """
    Computes the VAE loss. 
    """
    labels = tf.cast(x, tf.int32)
    xent_loss = K.sum(tfa.seq2seq.sequence_loss(x_decoded_mean, 
                        labels, weights=target_weights, 
                        average_across_timesteps=False,
                        average_across_batch=False), axis=-1)
    kl_loss = - 0.5 * tf.reduce_mean(z_log_var - tf.square(z_mean) - tf.exp(z_log_var) + 1)
    return K.mean(xent_loss + kl_loss)


  def call(self, inputs):
    x = self.embedding(inputs)
    z_mean, z_log_var, z = self.encoder(x)
    # Word dropout for the decoder, as described in the paper
    mask = self.get_mask(inputs)
    h_decoded, x_decoded_mean = self.decoder((x,z,mask))
    # Add KL divergence regularization loss.
    target_weights=tf.ones_like(inputs, dtype=tf.float32)
    loss = self.compute_loss(z_mean, z_log_var, z, 
                             inputs, x_decoded_mean, target_weights) 
    self.add_loss(loss)
    return h_decoded,x_decoded_mean  

In [0]:
LEARNING_RATE = 1e-3

In [0]:
def construct_model(emb_matrix, name):
  """
  Constructing the text VAE model.
  """
  N,D = emb_matrix.shape
  vae = textVAE(vocab_dim=(N,D),
                seq_length=MAX_SEQUENCE_LENGTH,
                unk_embedding=embeddings_index.get('unk'), # from GloVe embedding index
                emb_matrix=emb_matrix,
                name=name)

  optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)

  vae.compile(optimizer=optimizer)
  return vae

In [0]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

In [0]:
generator_pos = construct_model(emb_matrix_pos, name="pos_vae")
generator_pos

### text VAE with functional APIs
Here I implement the same model as above, but using the Keras functional API

In [0]:
from tensorflow.keras.layers import Bidirectional, Dense, Embedding, Input, Lambda, LSTM, RepeatVector, TimeDistributed, Layer, Activation, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model

In [0]:
def get_embedding(vocab, emb_matrix, N, D):
  x = Input(batch_shape=(None, MAX_SEQUENCE_LENGTH))
  x_embedded = Embedding(N, D, weights=[emb_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=True)(x)
  return x, x_embedded


def sample(z_mean, z_log_var):
  epsilon = K.random_normal(shape=(BATCH_SIZE, LATENT_DIM))
  return z_mean + tf.exp(0.5 * z_log_var) * epsilon  

def get_encoding(x_embedded, dropout_rate):
  hidden = Bidirectional(LSTM(HIDDEN_DIM, 
                              return_sequences=False, 
                              recurrent_dropout=dropout_rate),
                          merge_mode='concat')(x_embedded)
  hidden_dropped = Dropout(dropout_rate)(hidden) # dropping some of the input and output units, as suggested in the paper
  hidden = Dense(HIDDEN_DIM, activation='relu')(hidden_dropped)
  hidden_dropped = Dropout(dropout_rate)(hidden)
  z_mean = Dense(LATENT_DIM)(hidden_dropped)
  z_log_var = Dense(LATENT_DIM)(hidden_dropped)
  z = sample(z_mean, z_log_var)
  return z_mean, z_log_var, z, hidden

def get_decoding(z, hidden, dropout_rate, N):
  repeated_context = RepeatVector(MAX_SEQUENCE_LENGTH)
  sequence_decoded = LSTM(HIDDEN_DIM, 
                          return_sequences=True, 
                          recurrent_dropout=dropout_rate)(repeated_context(z), 
                                                          initial_state=[hidden,hidden])
  sequence_decoded_mean = TimeDistributed(Dense(N, activation='linear'))(sequence_decoded)
  return sequence_decoded, sequence_decoded_mean


class VAEloss(Layer):
    def __init__(self, **kwargs):
        self.is_placeholder = True
        super(VAEloss, self).__init__(**kwargs)

    def get_vae_loss(self, z_mean, z_log_var, z, x, inputs, decoded_mean):
      labels = tf.cast(inputs, tf.int32)
      target_weights=tf.ones_like(inputs, dtype=tf.float32)
      xent_loss = K.sum(tfa.seq2seq.sequence_loss(decoded_mean, 
                                                  labels,
                                                  weights=target_weights,
                                                  average_across_timesteps=False,
                                                  average_across_batch=False), axis=-1)
      kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
      return K.mean(xent_loss + kl_loss)

    def call(self, inputs):
      # unpack inputs
      z_mean = inputs[0]
      z_log_var = inputs[1]
      z = inputs[2]
      x = inputs[3]
      inputs_ = inputs[4]
      decoded_mean = inputs[5]
      # add loss 
      loss = self.get_vae_loss(z_mean, z_log_var, z, x, inputs_, decoded_mean)
      self.add_loss(loss, inputs=inputs)
      
      return K.ones_like(x)



#### Positive tweets model

In [109]:
# Nota bene: nothing is really computed here, we are just connecting the separate 
# computational graphs that we built above.
N,D = emb_matrix_pos.shape
inputs, embedding = get_embedding(vocab_pos, emb_matrix_pos, N, D)
z_mean, z_log_var, latent, encoding = get_encoding(embedding, dropout_rate=0.2)
decoding, decoding_mean = get_decoding(latent, encoding, dropout_rate=0.2, N=N)
# Now we inster the loss into the graph 
loss_layer = VAEloss()([z_mean, z_log_var, latent, embedding, inputs, decoding_mean])



In [110]:
VAE_pos = Model(inputs=inputs, outputs=[loss_layer], name="positiveVAE") 
VAE_pos.compile(optimizer='adam')
VAE_pos.summary()

Model: "positiveVAE"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_16 (InputLayer)           [(None, 100)]        0                                            
__________________________________________________________________________________________________
embedding_9 (Embedding)         (None, 100, 200)     1142400     input_16[0][0]                   
__________________________________________________________________________________________________
bidirectional_9 (Bidirectional) (None, 192)          228096      embedding_9[0][0]                
__________________________________________________________________________________________________
dropout_18 (Dropout)            (None, 192)          0           bidirectional_9[0][0]            
________________________________________________________________________________________

#### Negative tweets model

In [0]:
# computational graphs that we built above.
N,D = emb_matrix_pos.shape
inputs, embedding = get_embedding(vocab_pos, emb_matrix_pos, N, D)
z_mean, z_log_var, latent, encoding = get_encoding(embedding, dropout_rate=0.2)
decoding, decoding_mean = get_decoding(latent, encoding, dropout_rate=0.2, N=N)
# Now we inster the loss into the graph 
loss_layer = VAEloss()([z_mean, z_log_var, latent, embedding, inputs, decoding_mean])

In [0]:
VAE_neg = Model(inputs, [loss]) 

# Training 

In [0]:
model_checkpoint_file_pos = "textVAE_pos_checkpt.h5"
model_checkpoint_file_neg = "textVAE_neg_checkpt.h5"

In [95]:
EPOCHS = 1
N_STEPS = int((1-VALIDATION_SPLIT)*100000/BATCH_SIZE) 
N_STEPS

80

In [0]:
def train_model(model, input_file_location, vocab, model_location, 
                epochs=EPOCHS, n_steps=N_STEPS, batch_size=BATCH_SIZE):
  """
  Trains the given model for the given number of epochs, 
  making -n_steps- in each epoch. 
  """
  K.clear_session()

  for epoch in range(EPOCHS):
    print('-------epoch: ',epoch,'--------')
    model.fit(input_generator(positive_location, chunksize=BATCH_SIZE,
                                      max_seq_length=MAX_SEQUENCE_LENGTH, vocab=vocab),
                        epochs=1, steps_per_epoch=N_STEPS, 
                        validation_data = (validation_pos,validation_pos))
  print("Saving the model")
  model.save_weights(model_location)
  return model

#### Positive model training

In [0]:
generator_pos_trained = train_model(VAE_pos, 
                                    positive_location, 
                                    vocab_pos, 
                                    model_checkpoint_file_pos)

Now we want to define a separate model for the Encoder and for the Decoder

In [126]:
#ENCODER
N,D = emb_matrix_pos.shape
_inputs, _embedding = get_embedding(vocab_pos, emb_matrix_pos, N, D)
_, _, _latent, _encoding = get_encoding(_embedding, dropout_rate=0.2)
pos_encoder = Model(inputs=_inputs, outputs=[_latent,_encoding])
#DECODER 
_latent = Input(shape=(LATENT_DIM,))
_encoding = Input(shape=(HIDDEN_DIM,))
_decoding, _decoding_mean = get_decoding(_latent, _encoding, dropout_rate=0.2, N=N)
_decoding_mean = (Activation('softmax'))(_decoding_mean)
pos_decoder = Model(inputs=[_latent, _encoding], outputs=_decoding_mean)



# Tests 
We are now going to use the trained model to do sentence interpolation, but we need a few helper functions to do so 

In [0]:
index2word_pos = {v:k for k,v in vocab_pos.items()}
index2word_neg = {v:k for k,v in vocab_neg.items()}

In [0]:
def predict_latent(encoder,inputs):
  """
  Uses the encoder part of the trained generator to go 
  from the input space to the latent space.
  """
  latent,encoding = encoder(inputs)
  return latent,encoding

def predict_sentence(decoder,inputs):
  """
  Uses the decoder part of the trained generator to 
  go from the latent space to a new sentence.
  Inputs must be the tuple (x,z)
  """
  latent = model.embedding(inputs[0])
  encoding = inputs[1]
  decoding_mean = decoder(latent,encoding)
  return x_decoded_mean

In [0]:
def print_sampled_sentence(sentence_original, sentence_vector, decoder, index2word, latent_dim=LATENT_DIM, max_len=MAX_SEQUENCE_LENGTH):
  """
  Uses the trained model
  @:param: sentence vector: latent space representation of a sentence
      (this should be the output of predict_latent_mean)

  """
  N = len(index2word.keys())+1
  sentence_vector = np.reshape(sentence_vector,[1,latent_dim]) # reshaping into 1 x latent space, where 1 is the batch size
  generated = tf.keras.activation.softmax(predict_sentence(decoder, (sentence_original, sentence_vector)))
  generated = np.reshape(generated,[max_len,N]) # reshaping into sequence length x vocabulary words 
  generated_indices = np.apply_along_axis(np.argmax, 1, generated)
  word_list = list(np.vectorize(index2word.get)(generated_indices))
  w_list = [w for w in word_list if w] # filtering out the words
  print(' '.join(w_list))

In [0]:
def shortest_homologies(point1,point2,n):
  """
  Discover n mid-way  points in the path between point1 and point2.
  The name of the functions is due to the fact that the points are in the 
  latent space.
  """
  dist_vec = point2 - point1
  sample = np.linspace(0, 1, n, endpoint = True)
  samples = []
  for s in sample:
      samples.append(point1 + s * dist_vec)
  return samples

In [0]:
def sentences_interpolation(sentence1, sentence2, n, 
                            vocab, encoder, decoder, index2word, 
                            max_len=MAX_SEQUENCE_LENGTH):
  """
  Interpolating between the two given sentences in n steps.
  """
  sequence1 = sentence_to_sequence(sentence1, vocab)
  sequence1 = pad_sequences(sequence1, maxlen=max_len, padding="pre", value=0)
  sequence2 = sentence_to_sequence(sentence2, vocab)
  sequence2 = pad_sequences(sequence2, maxlen=max_len, padding="pre", value=0)
  latent1 = predict_latent(encoder,sequence1)
  latent2 = predict_latent(encoder,sequence2)
  homologies = shortest_homologies(latent1, latent2, n)
  for latent_sentence in homologies:
    print_sampled_sentence(sentence1, latent_sentence, decoder, index2word)

In [0]:
sentence1 = ["I think machine learning is great"]
sentence2 = ["Do you want a new book"]

In [135]:
print("Positive sentence interpolation")
sentences_interpolation(sentence1,sentence2,10,
                        vocab_pos, pos_encoder, pos_decoder, index2word_pos)

Positive sentence interpolation


ValueError: ignored

In [0]:
print("Negative sentence interpolation")
sentences_interpolation(sentence1,sentence2,10,
                        vocab_neg,generator_neg, index2word_neg)