In [23]:
import csv
import tensorflow as tf
import string
import random
from numpy import argmax
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.translate.bleu_score import corpus_bleu

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brace\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


LOADING PRETRAINED WORD EMBEDDING

We load the glove pretrained word embeddings which have embeddings for all the words in the english vocabulary, we use this list of words also to remove from the dataset every word that is not in the english vocabulary

In [118]:
embeddings_index = {}
f = open('glove.6B.100d.txt', encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


READING DATA FROM CSV FILE

We use a dataset consisting of approximatly 4000 articles with their summarization, here we read this data from a csv file and we do not consider articles that are too short (less than 5 words), articles that are long (more than 300 words, we do not consider this articles to ease the training process) and articles that are duplicates. We save this data in a dictionary and we then print how large the dataset that we will use is.

In [78]:
# reading the dataset from a csv file
with open('news_summary.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    # create a dictionary where we will store pairs (complete text/ summary)
    dataset = dict()
    # loop through every line in the csv file
    for row in csv_reader:
            # if the current line is not the first one (descriptions of fields) and it is not a duplicate, we update the dictionary
            # we also remove texts that are too long to ease the training process
            if row[3] != 'read_more' and row[3] not in dataset and len(row[5].split()) < 300:
                # push pairs of input text / summaries to dictionary (the key is the url of the text), we are removing really short input
                if len(row[5].split()) < 5 or len(row[4].split()) < 3:
                    continue
                dataset[row[3]] = list()
                dataset[row[3]].append(row[4])
                dataset[row[3]].append(row[5])
print(len(dataset))        

2366


CLEANING THE TEXTS

We clean the initial articles and the related summaries by removing punctuation and making all the characters lowercase.

In [79]:

def clean_descriptions(descriptions):
    # loop through all keys in dictionary
    for key, text_list in descriptions.items():
        # loop through the full text and summary associated to a given key
        for i in range(len(text_list)):
            # transform current sentence to array of words
            text = text_list[i].split()
            # convert all words to lower case
            text = [word.lower() for word in text]
            # remove punctuation from each word
            text = [w.translate(str.maketrans('', '', string.punctuation)) for w in text]
            # remove all the stopwords
            text = [w for w in text if not w in set(stopwords.words('english'))]
            # remove all the words that are not in the english dictionary
            text = [w for w in text if embeddings_index.get(w) is not None]
            # store cleaned descriptions as string and we need to add a startseq and endseq token
            text_list[i] =  'startseq ' + ' '.join(text) + ' endseq'
            
# remove punctuation from text and make all text lowercase 
clean_descriptions(dataset)

CREATE TOKENIZER AND FIND SIZES USEFUL FOR DEFINING THE MODEL

Here we fit a tokenizer to our dataset, afterwards we use this tokenizer object to find the size of our vocabulary (which is equal to the number of unique words in our dataset, we will need this value to know the size of the softmax output layer for our model). We also find the maximum length of input text and summary, this values are also used for defining the model.

In [80]:
# get a list from dictionary of data to be fed into tokenizer
listFromDictTot = list()
# append all full texts and summaries to the list
for key in dataset.keys():
        [listFromDictTot.append(d) for d in dataset[key]]
# tokenize all words (the machine learning model is going to need numbers as input)
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(listFromDictTot)
# get the size of the vocabulary created with our dataset (unique words in dataset)
vocabulary_size = len(tokenizer.word_index) + 1
print("The complete vocabulary size is " + str(vocabulary_size))
# get the maximum length of a full text and of a summary in the dataset (size needed when creating model to train)
max_summary_length = max(len(listFromDictTot[i].split()) for i in range(0,len(listFromDictTot),2))
max_length = max(len(listFromDictTot[i].split()) for i in range(1,len(listFromDictTot),2))
print('Maximum full text Length: %d' % max_length)
print('Maximum summary Length: %d' % max_summary_length)

The complete vocabulary size is 24255
Maximum full text Length: 194
Maximum summary Length: 47


TRAIN/TEST SPLIT

Here we split the dataset in train and test data (we create 2 dictionaries containing this data), in the end of the block we also check the size of the training and test dataset to make sure that everything worked correctly.

In [81]:
# get the keys
keys = list(dataset.keys())
# reorder the keys randomly
random.shuffle(keys)
# set the percentage of data that will be used for training
trainPercentage = 0.8
# get the last index for which the data will be assigned to the training test
trainLastIndex = int(trainPercentage * len(keys))
trainDataset = dict()
# fill training dictionary
for i in range(trainLastIndex):
    trainDataset[keys[i]] = dataset[keys[i]]
testDataset = dict()
# fill test dictionary
for i in range(trainLastIndex, len(keys)):
    testDataset[keys[i]] = dataset[keys[i]]
# print size of the train and test data to make sure that everything worked well
print("The length of the training dataset is " + str(len(trainDataset)))
print("The length of the test dataset is " + str(len(testDataset))) 

The length of the training dataset is 1892
The length of the test dataset is 474


TRANSFORM THE DATA IN A BETTER FORMAT

Here we transform the dictionary containing the training data into two lists, one contains the summaries and the other the complete text. The text is transformed to a sequence of tokens and is padded to the maximum lengths before being inserted in the relative lists.

In [82]:
inputTextDataset = list()
targetSummaryDataset = list()
# append all full texts and summaries to the list
for key in trainDataset.keys():
        inputTextDataset.append(tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([trainDataset[key][1]])[0]], maxlen = max_length, padding='post'))
        targetSummaryDataset.append(tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([trainDataset[key][0]])[0]], maxlen = max_summary_length, padding='post'))

DEFINE GAN MODEL

Here we define a GAN model where the generator is biderectional LSTM that takes as input a full text and produces a summary, this is then fed with real summaries to generator (which uses a CNN, that is usually considered a good model for text classification) that is supposed to distinguish between artificial and real summaries. Unfortunately we did not succeed in stabilize the training, therefore we did not obtain good results (mostly because of the short time and computing resources available), we tried some techniques including adding a minibatch standard deviation layer in the end of the discriminator, using batch normalization and substituting ReLU with LeakyReLU, we believe that the discriminator is training much faster than generator, forcing the generator to stop learning because of the lack of incentives. Given the problems in stabilizing the adversarial training we opted for an encoder decoder architecture which is defined in the next box.

In [124]:
latent_dim = 200
embedding_dim=110
batch_size = 16

# mini-batch standard deviation layer
class MinibatchStdev(tf.keras.layers.Layer):
    # initialize layer
    def __init__(self, **kwargs):
        super(MinibatchStdev, self).__init__(**kwargs)
    # perform the operation
    def call(self, inputs):
        # size of group of which standard deviation is to be computed
        group_size = tf.shape(inputs)[0]
        shape = list(tf.keras.backend.int_shape(inputs))
        shape[0] = tf.shape(inputs)[0]
        minibatch = tf.keras.backend.reshape(inputs,(group_size, -1, shape[1], shape[2]))
        # substracts the mean from every element
        minibatch -= tf.reduce_mean(minibatch, axis=0, keepdims=True)
        # compute the square of the means from the previous results
        minibatch = tf.reduce_mean(tf.keras.backend.square(minibatch), axis = 0)
        # get sqrt of the previous results
        minibatch = tf.keras.backend.sqrt(minibatch + 1e8)
        minibatch = tf.reduce_mean(minibatch, keepdims=True)
        # tile output to get wanted size
        minibatch = tf.keras.backend.tile(minibatch,[group_size, 1, shape[2]])
        # add the result of the computation to the input
        return tf.keras.backend.concatenate([inputs, minibatch], axis=1)

    def compute_output_shape(self, input_shape):
        # create a copy of the input shape as a list
        input_shape = list(input_shape)
        # add one to the channel dimension where we will have the standard deviation
        input_shape[-1] += 1
        # convert list to a tuple
        return tuple(input_shape)

# use sequential API to define generator model    
generator = tf.keras.models.Sequential()
# dropout layer for regulaziation
generator.add(tf.keras.layers.Dropout(0.5))
# biderectional lstm which will return the entire sequence
generator.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(150, input_shape=(max_length, vocabulary_size), return_sequences=True)))
# time distributed layer to deal with the entire sequence of outputs and classifying correct word with dense softmax layer
generator.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(vocabulary_size, activation='softmax')))
# we crop the output in order to have the wanted dimension
generator.add(tf.keras.layers.Cropping1D(cropping=(max_length - max_summary_length,0)))

# use sequential API to define discriminator model  
discriminator = tf.keras.models.Sequential()
# use batch normalization because it is usually helpful in GAN training
discriminator.add(tf.keras.layers.BatchNormalization())
# one dimensional convolutional layer
discriminator.add(tf.keras.layers.Conv1D(64, 5,  input_shape=[max_summary_length, vocabulary_size]))
# using leakyrelu becausee it proved good results for gans
discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.02))
# dropout layer for normalization
discriminator.add(tf.keras.layers.Dropout(0.5))
# one dimensional convolutional layer
discriminator.add(tf.keras.layers.Conv1D(128, 5))
# using leakyrelu becausee it proved good results for gans
discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.02))
# adding minibatch standard deviation layer to give the generator an incentive to generate different summaries
discriminator.add(MinibatchStdev())
# sigmoid layer for binary classification (real vs artificial summary)
discriminator.add(tf.keras.layers.Dense(1, activation='sigmoid'))

# combining the generator and the discriminator to create the gan
gan = tf.keras.models.Sequential([generator, discriminator])
# compile the discriminator using binary crossentropy (binary classification) and adam optimizer
discriminator.compile(loss="binary_crossentropy", optimizer="adam")
# making the discriminator untrainable and com´pile the entire model (for the training of the generator)
discriminator.trainable = False
gan.compile(loss="binary_crossentropy", optimizer="adam")

DEFINING THE ENCODER-DECODER MODEL

This encoder decoder architecture embedds the input tokens (from the complete texts) and then feed this data into an LSTM, we then use the final state of this LSTM as the initial state for the decoder model which is another LSTM, also here we embedd the input tokens (which now are the summaries) before passing them to an LSTM, after this LSTM we place a dense layer with softmax activation function that outputs a probability distribution over the vocabulary. We use the glove pretrained word embeddings in order to make training easier.

In [147]:
batch_size = 16


# size of embedding as defined in the glove model
embedding_size = 100
lstm_dim = 320

#create a matrix that contains that weights for our vocabulary as we found in the glove pretrained model
embedding_matrix = np.zeros((vocabulary_size, embedding_size))
# loops through all the words in our dictionary
for word, i in tokenizer.word_index.items():
    # get the embedding vector associated to the word
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in will be zeros.
        embedding_matrix[i] = embedding_vector
        
        
# define encoder model using the functional API               
encoder_inputs = tf.keras.layers.Input(shape=(None,))
# we use an embedding layer which has the weights found in the pretrained glove model (this layer will not be trained)
embeddingEncoder = tf.keras.layers.Embedding(vocabulary_size, embedding_size, weights=[embedding_matrix], trainable=False)(encoder_inputs)
# dropout layer for regularization
embeddingEncoder = tf.keras.layers.Dropout(0.5)(embeddingEncoder)
# lstm layer where we set return_state to true because we are interested in its final state (initial state of the decoder)
encoder = tf.keras.layers.LSTM(lstm_dim, return_state=True)
# state_h is the short term final state of the lstm while state_c is the long term final state of the lstm
_, state_h, state_c = encoder(embeddingEncoder)
# saving final encoder states
encoder_states = [state_h, state_c]

# define encoder model using the functional API
decoder_inputs = tf.keras.layers.Input(shape=(None,))
# we use an embedding layer which has the weights found in the pretrained glove model (this layer will not be trained)
final_dex = tf.keras.layers.Embedding(vocabulary_size, embedding_size, weights=[embedding_matrix], trainable=False)(decoder_inputs)
# dropout layer for regularization
final_dex = tf.keras.layers.Dropout(0.5)(final_dex)
# lstm layer where we set return_state to true because we are interested in its final state, as it will be the state used to predict the next word, we also need the output in this case
decoder_lstm = tf.keras.layers.LSTM(lstm_dim, return_state=True, return_sequences=True)
# the decoder output is now the output of the model (at the current timestep)
decoder_outputs, _, _ = decoder_lstm(final_dex, initial_state=encoder_states)
# after we will need a dense layer with softmax activation to generate the probability distribution over the vocabulary
decoder_dense = tf.keras.layers.Dense(vocabulary_size, activation='softmax') 
decoder_outputs = decoder_dense(decoder_outputs)
# now we create the model that we will use with encoder and decoder inputs, and decoder outputs
model = tf.keras.models.Model([encoder_inputs, decoder_inputs], decoder_outputs)
# we compile the model using rmsprop and categorical_crossentropy as the loss function
model.compile(optimizer = 'adam',
              loss = 'categorical_crossentropy',
              metrics=['accuracy'])
# print model summary
model.summary()

Model: "functional_281"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_158 (InputLayer)          [(None, None)]       0                                            
__________________________________________________________________________________________________
input_159 (InputLayer)          [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_95 (Embedding)        (None, None, 100)    2425500     input_158[0][0]                  
__________________________________________________________________________________________________
embedding_96 (Embedding)        (None, None, 100)    2425500     input_159[0][0]                  
_____________________________________________________________________________________

MAKING THE DATA READY FOR TRAINING

Here we simply create a dataset object by combining complete text and summaries, this dataset is then shuffled and batched, after these steps we are ready to start training

In [148]:
# convert the data to a dataset object and shuffle it
dataset = tf.data.Dataset.from_tensor_slices((inputTextDataset, targetSummaryDataset)).shuffle(1000)
# we batch and prefetch (start filling buffer used during training in background thread) the dataset
dataset = dataset.batch(batch_size).prefetch(1)

TRAINING FOR ENCODER DECODER

Here we train he encoder decoder model, we use this loop in order to not have to store all the one hot outputs in one large matriz considering that the vocabulary size is quiet large (almost 40000 entries).

In [150]:
def train_encoder_decoder(dataset, batch_size, n_epochs = 50):
    epochCounter = 0
    for epoch in range(n_epochs):
        print("The current epoch is " + str(epochCounter))
        print(decode_seq(inputSentence))
        epochCounter += 1
        # use batches to make the one hot encoded outputs take less space
        for encoder_input_batch, decoder_input_batch in dataset:
            # reshape input for encoder (full sentence) and decoder (short version) to two dimensional tensors
            encoder_input_batch = tf.reshape(encoder_input_batch, [encoder_input_batch.get_shape()[0], encoder_input_batch.get_shape()[2]])
            decoder_input_batch = tf.reshape(decoder_input_batch, [decoder_input_batch.get_shape()[0], decoder_input_batch.get_shape()[2]])
            decoder_output = list()
            # create the decoder output by shifting the decoder input by one
            for i in range(encoder_input_batch.get_shape()[0]):
                for j in range(max_summary_length):
                    if j == 0:
                        decoder_output.append(list())  
                        continue
                    decoder_output[i].append(decoder_input_batch[i][j])
                # we append a final zero to get matching sizes    
                decoder_output[i].append(0)
            # we make the outputs one hot encoded vectors    
            decoder_output = tf.one_hot(decoder_output, vocabulary_size)
            model.fit([encoder_input_batch, decoder_input_batch], decoder_output, verbose=0)
                 
# train the encoder-decoder model
train_encoder_decoder(dataset, batch_size)


GAN TRAINING

Here is our attempt at training the GAN model, we start by generating summaries for the current batch using the generator, we then use this summaries (which will have label 0) and the real summaries (which will have label 1) to train the discriminator. Afterwards we set the discriminator to not trainable and we train the entire GAN (only the weights of the generator can change) using labels 1 for the outputs of the genarator, therefore trying to make the generator convince the discriminator that the output that it is producing are real summaries.

In [None]:
def train_gan(dataset, batch_size, n_epochs = 5):
    generator, discriminator = gan.layers
    epochCounter = 0
    for epoch in range(n_epochs):
        print("The current epoch is " + str(epochCounter))
        epochCounter += 1
        for X_batch, Y_batch in dataset:
            # discriminator training
            # we reshape the X_batch to a two dimensional tensor
            X_batch = tf.reshape(X_batch, [X_batch.get_shape()[0], X_batch.get_shape()[2]])
            # we use one hot encoding to get the representation expected by the model
            X_batch = tf.one_hot(X_batch, vocabulary_size)
            # we generate fake summaries using the current generator
            generated_summaries = generator(X_batch)
            # reshaping Y batch to 2 dimensions and then using one hot encoding to get correct model input format
            Y_batch = tf.reshape(Y_batch, [Y_batch.get_shape()[0], Y_batch.get_shape()[2]])
            Y_batch = tf.one_hot(Y_batch, vocabulary_size)
            # we concatenate real and fake summaries (which will be input to the discriminator)
            X_fake_and_real = tf.concat( [generated_summaries, Y_batch], axis=0)
            # we give a label of 0 (fake) to the artificially generated summaries and a label of 1 (real) to the real summaries
            y1 = tf.constant([[0.]] * int(X_fake_and_real.get_shape()[0] / 2) + [[1.]] * int(X_fake_and_real.get_shape()[0] / 2) )
            # we now make the dicriminator trainable and then we train it
            discriminator.trainable = True
            discriminator.train_on_batch(X_fake_and_real, y1)
            # here we assign a label of 1 (real) to the artificially generated summaries and then we train the generator (the discriminator is now made untrainable)
            y2 = tf.constant([[1.]] * int(X_fake_and_real.get_shape()[0] / 2))
            discriminator.trainable = False
            gan.train_on_batch(X_batch, y2) # substitute noise with the corrent x_input
                 
# train the GAN model
train_gan(dataset, batch_size)

RUNNING INFERENCE WITH THE GENERATOR PART OF THE GAN MODEL

We use the generator from the gan model that we defined to generate summaries from input text.

In [142]:
# this function recovers the original word from a token (if there is a translation)
def id_to_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
# we get the generator from the gan model
generator, _ = gan.layers
# we transform an input text to the correct format
inputSentence = tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([testDataset[list(testDataset.keys())[3]][0]])[0]], maxlen = max_length, padding='post')
# we run inference and then get a two dimensional shape
summary = generator(tf.one_hot(inputSentence, vocabulary_size)) # generate summary
summary = tf.reshape(summary, [summary.get_shape()[1], summary.get_shape()[2]])
finalSummary = ""
# we transform the indexes to words
for i in range(len(summary)):
    currentWord = id_to_word(argmax(summary[i]), tokenizer)
    if currentWord=="startseq":
        continue
    elif currentWord=="endseq":
        break
    else:
        finalSummary = finalSummary + " " + currentWord 

RUNNING INFERENCE WITH THE ENCODER DECODER MODEL

We use the encoder decoder architecture trained previously to run inference on an input sequence of words (which is padded), we use a greedy approach to picking the next word.

In [152]:
#encoder-decoder model for inference
# link encoder inference to the encoder layers that we trained
encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
# create a placeholder for the decoder input states
decoder_state_input_h = tf.keras.layers.Input(shape=(lstm_dim,))
decoder_state_input_c = tf.keras.layers.Input(shape=(lstm_dim,))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]
# link the previously trained decoder input layer to the decoder inference model
inference_decoder = tf.keras.layers.Embedding(vocabulary_size, embedding_size, weights=[embedding_matrix])(decoder_inputs)
# we link the previously trained lstm decoder weights to our inference decoder
inference_decoder_outputs, state_h2, state_c2 = decoder_lstm(inference_decoder, initial_state=decoder_state_inputs)
decoder_states2 = [state_h2, state_c2]
# we link the final decoder layer to the inference model and then we combine the decoder layers in a single model
inference_decoder_outputs = decoder_dense(inference_decoder_outputs)
decoder_model = tf.keras.models.Model([decoder_inputs] + decoder_state_inputs, [inference_decoder_outputs] + decoder_states2)


def decode_seq(input_seq):
    # get the final states from the encoder
    state_values = encoder_model.predict(input_seq)
    target_seq = np.zeros((1,1))
    # get the startseq token to initialize inference
    target_seq[0,0] = tokenizer.texts_to_sequences(['startseq'])[0][0]
    decoded_sentence = ''
    for i in range(max_summary_length):
        # run inference for the current word given the states that we are in
        output_tokens, h, c = decoder_model.predict([target_seq] + state_values)
        # get the word with the highest probability at the current step
        sampled_token_index = np.argmax(output_tokens[0,-1,:])
        sampled_char = id_to_word(sampled_token_index, tokenizer)
        # if the word is not in the vocabulary we break
        if sampled_char == None:
            break
        # if we reached the end of the sentence we break    
        if(sampled_char == 'endseq'):
            stop_condition = True    
        # we add the word to our prediction    
        decoded_sentence += ' ' + sampled_char
        # update state input and word input with current state output and word output
        target_seq = np.zeros((1,1))
        target_seq[0,0] = sampled_token_index
        state_values = [h,c] 

    return decoded_sentence

EVALUATION 

We evaluate the dataset using BLEU-1 score and we do this by using the inference function described above. We ran evaluation using the test dataset which we set aside from the entire dataset in the beginning of the notebook.

In [132]:
def evaluate_model(dataset):
    # actual keeps track of correct sequence while predicted is the predicted sequence
    actual, predicted = list(), list()
    # looping through every key in dataset
    for key, texts in dataset.items():
        # prediction for current text
        prediction = decode_seq(tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([texts[1]])[0]], maxlen = max_length, padding='post'))
        # get the correct descriptions and remove the startseq and endseq tokens
        correctSummary = texts[0].split()[1:-1]
        # append the prediction and the correct summary to the lists
        actual.append(correctSummary)
        predicted.append(prediction.split())
    # calculate BLEU score for 1 GRAM
    print("Bleu-1 score: " + str(corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))))
    
evaluate_model(testDataset)    