In [10]:
import csv
import tensorflow as tf
import string
import random
from numpy import argmax
import numpy as np

READING DATA FROM CSV FILE

We use a dataset consisting of approximatly 4000 articles with their summarization, here we read this data from a csv file and we do not consider articles that are too short (less than 5 words), articles that are long (more than 300 words, we do not consider this articles to ease the training process) and articles that are duplicates. We save this data in a dictionary and we then print how large the dataset that we will use is.

In [2]:
# reading the dataset from a csv file
with open('news_summary.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    # create a dictionary where we will store pairs (complete text/ summary)
    dataset = dict()
    # loop through every line in the csv file
    for row in csv_reader:
            # if the current line is not the first one (descriptions of fields) and it is not a duplicate, we update the dictionary
            # we also remove texts that are too long to ease the training process
            if row[3] != 'read_more' and row[3] not in dataset and len(row[5].split()) < 300:
                # push pairs of input text / summaries to dictionary (the key is the url of the text), we need to add a startseq and endseq token
                # we are removing really short input
                if len(row[5].split()) < 5 or len(row[4].split()) < 3:
                    continue
                dataset[row[3]] = list()
                dataset[row[3]].append('startseq ' + row[4] + ' endseq')
                dataset[row[3]].append('startseq ' + row[5] + ' endseq')
print(len(dataset))        

2366


CLEANING THE TEXTS

We clean the initial articles and the related summaries by removing punctuation and making all the characters lowercase.

In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
def clean_descriptions(descriptions):
    # loop through all keys in dictionary
    for key, text_list in descriptions.items():
        # loop through the full text and summary associated to a given key
        for i in range(len(text_list)):
            # transform current sentence to array of words
            text = text_list[i].split()
            # convert all words to lower case
            text = [word.lower() for word in text]
            # remove punctuation from each word
            text = [w.translate(str.maketrans('', '', string.punctuation)) for w in text]
            text = [w for w in text if not w in set(stopwords.words('english'))]
            # store cleaned descriptions as string
            text_list[i] =  ' '.join(text)
            
# remove punctuation from text and make all text lowercase 
clean_descriptions(dataset)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brace\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


CREATE TOKENIZER AND FIND SIZES USEFUL FOR DEFINING THE MODEL

Here we fit a tokenizer to our dataset, afterwards we use this tokenizer object to find the size of our vocabulary (which is equal to the number of unique words in our dataset, we will need this value to know the size of the softmax output layer for our model). We also find the maximum length of input text and summary, this values are also used for defining the model.

In [5]:
# get a list from dictionary of data to be fed into tokenizer
listFromDictTot = list()
# append all full texts and summaries to the list
for key in dataset.keys():
        [listFromDictTot.append(d) for d in dataset[key]]
# tokenize all words (the machine learning model is going to need numbers as input)
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(listFromDictTot)
# get the size of the vocabulary created with our dataset (unique words in dataset)
vocabulary_size = len(tokenizer.word_index) + 1
print("The complete vocabulary size is " + str(vocabulary_size))
# get the maximum length of a full text and of a summary in the dataset (size needed when creating model to train)
max_summary_length = max(len(listFromDictTot[i].split()) for i in range(0,len(listFromDictTot),2))
max_length = max(len(listFromDictTot[i].split()) for i in range(1,len(listFromDictTot),2))
print('Maximum full text Length: %d' % max_length)
print('Maximum summary Length: %d' % max_summary_length)

The complete vocabulary size is 38568
Maximum full text Length: 210
Maximum summary Length: 49


TRAIN/TEST SPLIT

Here we split the dataset in train and test data (we create 2 dictionaries containing this data), in the end of the block we also check the size of the training and test dataset to make sure that everything worked correctly.

In [6]:
# get the keys
keys = list(dataset.keys())
# reorder the keys randomly
random.shuffle(keys)
# set the percentage of data that will be used for training
trainPercentage = 0.8
# get the last index for which the data will be assigned to the training test
trainLastIndex = int(trainPercentage * len(keys))
trainDataset = dict()
# fill training dictionary
for i in range(trainLastIndex):
    trainDataset[keys[i]] = dataset[keys[i]]
testDataset = dict()
# fill test dictionary
for i in range(trainLastIndex, len(keys)):
    testDataset[keys[i]] = dataset[keys[i]]
# print size of the train and test data to make sure that everything worked well
print("The length of the training dataset is " + str(len(trainDataset)))
print("The length of the test dataset is " + str(len(testDataset))) 

The length of the training dataset is 1892
The length of the test dataset is 474


TRANSFORM THE DATA IN A BETTER FORMAT

Here we transform the dictionary containing the training data into two lists, one contains the summaries and the other the complete text. The text is transformed to a sequence of tokens and is padded to the maximum lengths before being inserted in the relative lists.

In [7]:
inputTextDataset = list()
targetSummaryDataset = list()
# append all full texts and summaries to the list
for key in trainDataset.keys():
        inputTextDataset.append(tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([trainDataset[key][1]])[0]], maxlen = max_length, padding='post'))
        targetSummaryDataset.append(tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([trainDataset[key][0]])[0]], maxlen = max_summary_length, padding='post'))

DEFINE GAN MODEL

Here we define a GAN model where the generator is biderectional LSTM that takes as input a full text and produces a summary, this is then fed with real summaries to generator (which uses a CNN, that is usually considered a good model for text classification) that is supposed to distinguish between artificial and real summaries. Unfortunately we did not succeed in stabilize the training, therefore we did not obtain good results (mostly because of the short time and computing resources available), we tried some techniques including adding a minibatch standard deviation layer in the end of the discriminator, using batch normalization and substituting ReLU with LeakyReLU, we believe that the discriminator is training much faster than generator, forcing the generator to stop learning because of the lack of incentives. Given the problems in stabilizing the adversarial training we opted for an encoder decoder architecture which is defined in the next box.

In [8]:
latent_dim = 200
embedding_dim=110
batch_size = 16

# mini-batch standard deviation layer
class MinibatchStdev(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(MinibatchStdev, self).__init__(**kwargs)
    # perform the operation
    def call(self, inputs):
        group_size = tf.shape(inputs)[0]
        shape = list(tf.keras.backend.int_shape(inputs))
        shape[0] = tf.shape(inputs)[0]
        minibatch = tf.keras.backend.reshape(inputs,(group_size, -1, shape[1], shape[2]))
        minibatch -= tf.reduce_mean(minibatch, axis=0, keepdims=True)
        minibatch = tf.reduce_mean(tf.keras.backend.square(minibatch), axis = 0)
        minibatch = tf.keras.backend.square(minibatch + 1e8)
        minibatch = tf.reduce_mean(minibatch, keepdims=True)
        minibatch = tf.keras.backend.tile(minibatch,[group_size, 1, shape[2]])
        return tf.keras.backend.concatenate([inputs, minibatch], axis=1)

    def compute_output_shape(self, input_shape):
        # create a copy of the input shape as a list
        input_shape = list(input_shape)
        # add one to the channel dimension (assume channels-last)
        input_shape[-1] += 1
        # convert list to a tuple
        return tuple(input_shape)

    
generator = tf.keras.models.Sequential()
generator.add(tf.keras.layers.Dropout(0.5))
# generator.add(tf.keras.layers.LSTM(150, input_shape=(max_length, embedding_dim),  batch_input_shape=[batch_size], stateful=True))
# generator.add(tf.keras.layers.RepeatVector(max_length))
generator.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(150, input_shape=(max_length, vocabulary_size), return_sequences=True, stateful=False)))
generator.add(tf.keras.layers.Dropout(0.5))
generator.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(150, input_shape=(max_length, vocabulary_size), return_sequences=True, stateful=False)))
generator.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(vocabulary_size, activation='softmax')))
generator.add(tf.keras.layers.Cropping1D(cropping=(max_length - max_summary_length,0)))


discriminator = tf.keras.models.Sequential()
discriminator.add(tf.keras.layers.BatchNormalization())
discriminator.add(tf.keras.layers.Conv1D(64, 5,  input_shape=[max_summary_length, vocabulary_size], batch_size=batch_size))
discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.02))
discriminator.add(tf.keras.layers.Dropout(0.5))
discriminator.add(tf.keras.layers.Conv1D(128, 5))
discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.02))
discriminator.add(MinibatchStdev())
discriminator.add(tf.keras.layers.Dense(1, activation='sigmoid'))

gan = tf.keras.models.Sequential([generator, discriminator])
# compile the model
discriminator.compile(loss="binary_crossentropy", optimizer="adam")
discriminator.trainable = False
gan.compile(loss="binary_crossentropy", optimizer="adam")
#generator.summary()

DEFINING THE ENCODER-DECODER MODEL

This encoder decoder architecture embedds the input tokens (from the complete texts) and then feed this data into an LSTM, we then use the final state of this LSTM as the initial state for the decoder model which is another LSTM, also here we embedd the input tokens (which now are the summaries) before passing them to an LSTM, after this LSTM we place a dense layer with softmax activation function that outputs a probability distribution over the vocabulary. We use the glove pretrained word embeddings in order to make training easier.

In [11]:
embeddings_index = {}
f = open('glove.6B.100d.txt', encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [18]:
batch_size = 16


embedding_size = 100
lstm_dim = 320
embedding_matrix = np.zeros((vocabulary_size, embedding_size))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        
encoder_inputs = tf.keras.layers.Input(shape=(max_length))
en_x = tf.keras.layers.Embedding(vocabulary_size, embedding_size, weights=[embedding_matrix], input_length=max_length, trainable=False)(encoder_inputs)
en_x = tf.keras.layers.Dropout(0.5)(en_x)
encoder = tf.keras.layers.LSTM(lstm_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(en_x)
encoder_states = [state_h, state_c]
decoder_inputs = tf.keras.layers.Input(shape=(max_summary_length))
final_dex = tf.keras.layers.Embedding(vocabulary_size, embedding_size, weights=[embedding_matrix], input_length=max_summary_length, trainable=False)(decoder_inputs)
final_dex = tf.keras.layers.Dropout(0.5)(final_dex)
decoder_lstm = tf.keras.layers.LSTM(lstm_dim, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(final_dex, initial_state=encoder_states)

decoder_dense = tf.keras.layers.Dense(vocabulary_size, activation='softmax') 

decoder_outputs = decoder_dense(decoder_outputs)
model = tf.keras.models.Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer = 'rmsprop',
              loss = 'categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "functional_13"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_9 (InputLayer)            [(None, 210)]        0                                            
__________________________________________________________________________________________________
input_10 (InputLayer)           [(None, 49)]         0                                            
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 210, 100)     3856800     input_9[0][0]                    
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 49, 100)      3856800     input_10[0][0]                   
______________________________________________________________________________________

MAKING THE DATA READY FOR TRAINING

Here we simply create a dataset object by combining complete text and summaries, this dataset is then shuffled and batched, after these steps we are ready to start training

In [19]:
fullTextDataset = tf.data.Dataset.from_tensor_slices(inputTextDataset)
summaryDataset = tf.data.Dataset.from_tensor_slices(targetSummaryDataset) # .shuffle(1000)
dataset = tf.data.Dataset.from_tensor_slices((inputTextDataset, targetSummaryDataset)).shuffle(1000)
dataset = dataset.batch(batch_size).prefetch(1)

TRAINING FOR ENCODER DECODER

Here we train he encoder decoder model, we use this loop in order to not have to store all the one hot outputs in one large matriz considering that the vocabulary size is quiet large (almost 40000 entries).

In [None]:
#Inference Stage
import numpy as np
#encoder model
# this function recovers the original word from a token (if there is a translation)
def id_to_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
#encoder-decoder model
encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
decoder_state_input_h = tf.keras.layers.Input(shape=(lstm_dim,))
decoder_state_input_c = tf.keras.layers.Input(shape=(lstm_dim,))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

final_dex2 = tf.keras.layers.Embedding(vocabulary_size, embedding_size)(decoder_inputs)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_state_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = tf.keras.models.Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs2] + decoder_states2)

def decode_seq(input_seq):
    state_values = encoder_model.predict(input_seq)

    target_seq = np.zeros((1,1))
    target_seq[0,0] = tokenizer.texts_to_sequences(['startseq'])[0][0]

    stop_condition = False
    decoded_sentence = ''
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + state_values)

        sampled_token_index = np.argmax(output_tokens[0,-1,:])
        sampled_char = id_to_word(sampled_token_index, tokenizer)
        if sampled_char == None:
            break
        decoded_sentence += ' ' + sampled_char

        if(sampled_char == 'endseq' or len(decoded_sentence) > max_summary_length):
            stop_condition = True

        target_seq = np.zeros((1,1))
        target_seq[0,0] = sampled_token_index

        state_values = [h,c] 

    return decoded_sentence
inputSentence = tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([trainDataset[list(trainDataset.keys())[3]][1]])[0]], maxlen = max_length, padding='post')
def train_encoder_decoder(dataset, batch_size, n_epochs = 50):
    epochCounter = 0
    for epoch in range(n_epochs):
        print("The current epoch is " + str(epochCounter))
        print(decode_seq(inputSentence))
        epochCounter += 1
        batchCounter = 0
        for encoder_input_batch, decoder_input_batch in dataset:
            batchCounter += 1
            encoder_input_batch = tf.reshape(encoder_input_batch, [encoder_input_batch.get_shape()[0], encoder_input_batch.get_shape()[2]])
            decoder_input_batch = tf.reshape(decoder_input_batch, [decoder_input_batch.get_shape()[0], decoder_input_batch.get_shape()[2]])
            decoder_output = list()
            for i in range(encoder_input_batch.get_shape()[0]):
                for j in range(max_summary_length):
                    if j == 0:
                        decoder_output.append(list())  
                        continue
                    decoder_output[i].append(decoder_input_batch[i][j])
                decoder_output[i].append(0)
            decoder_output = tf.one_hot(decoder_output, vocabulary_size)
            model.fit([encoder_input_batch, decoder_input_batch], decoder_output, verbose=1)
                 
# train the GAN model
        
train_encoder_decoder(dataset, batch_size)


The current epoch is 0
 salamabad sadhna outbid misunderstandings damaged
The current epoch is 1
 actor actor actor actor actor said said said said


In [92]:
model.save("encoder_decoder.h5")
trainedModel = model

GAN TRAINING

Here is our attempt at training the GAN model, we start by generating summaries for the current batch using the generator, we then use this summaries (which will have label 0) and the real summaries (which will have label 1) to train the discriminator. Afterwards we set the discriminator to not trainable and we train the entire GAN (only the weights of the generator can change) using labels 1 for the outputs of the genarator, therefore trying to make the generator convince the discriminator that the output that it is producing are real summaries.

In [None]:
def train_gan(dataset, batch_size, n_epochs = 5):
    generator, discriminator = gan.layers
    epochCounter = 0
    for epoch in range(n_epochs):
        print("The current epoch is " + str(epochCounter))
        epochCounter += 1
        currentBatch = 0
        for X_batch, Y_batch in dataset:
            # phase 1 training the discriminator
            # noise = tf.random.normal(shape=[batch_size, codings_size])
            #print("The current batch is " + str(currentBatch))
            currentBatch += 1
            X_batch = tf.reshape(X_batch, [X_batch.get_shape()[0], X_batch.get_shape()[2]])
            X_batch = tf.one_hot(X_batch, vocabulary_size)
            generated_reviews = generator(X_batch) # substitute noise with current x_input
            summary = tf.reshape(generated_reviews[0], [generated_reviews[0].get_shape()[0], generated_reviews[0].get_shape()[1]])
            Y_batch = tf.reshape(Y_batch, [Y_batch.get_shape()[0], Y_batch.get_shape()[2]])
            Y_batch = tf.one_hot(Y_batch, vocabulary_size)
            X_fake_and_real = tf.concat( [generated_reviews, Y_batch], axis=0)
            y1 = tf.constant([[0.]] * int(X_fake_and_real.get_shape()[0] / 2) + [[1.]] * int(X_fake_and_real.get_shape()[0] / 2) )
            discriminator.trainable = True
            discriminator.train_on_batch(X_fake_and_real, y1)
            # phase 2 training the generator
            y2 = tf.constant([[1.]] * int(X_fake_and_real.get_shape()[0] / 2))
            discriminator.trainable = False
            gan.train_on_batch(X_batch, y2) # substitute noise with the corrent x_input
                 
# train the GAN model
train_gan(dataset, batch_size)

In [98]:
gan.save('finalGan.h5')

In [99]:
generator, discriminator = gan.layers
generator.save('finalGenerator.h5')

In [128]:
testKeys = list()
for key in testDataset.keys():
    if len(testDataset[key][0]) < 300 and len(testDataset[key][1]) > 20:
        testKeys.append(key)
        print(testDataset[key][1])

startseq bollywood actor taapsee pannu will soon be seen in a completely different avatar in her upcoming film judwaa 2 even as the pink star shoots for the film opposite varun dhawan she has signed anubhav sinhas next which is a social thriller titled mulk the film will also feature rishi kapoor and taapsee plays the role of his daughterinlaw in the movie the shooting of mulk will begin in a few months and will be shot in lucknow and varanasifrom a superhero to vulnerable human beings in a small town in my country mulk a space i grew up in but never visited as a film maker anubhav sinha anubhavsinha july 28 2017mulk is a story inspired by true events about a joint family that is involved in a controversy and taapsees character is helping the family reclaim their honourfollow htshowbiz for more endseq
startseq taimur ali khan is undoubtedly one of the cutest star kids on the block the fivemonthold toddler enjoys as much fan following as his mother kareena kapoor khan ever since his bir

In [129]:
# this function recovers the original word from a token (if there is a translation)
def word_to_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
generator = tf.keras.models.load_model('finalGenerator.h5')
inputSentence = tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([testDataset[list(testDataset.keys())[3]][0]])[0]], maxlen = max_length, padding='post')
summary = generator(inputSentence) # generate summary
summary = tf.reshape(summary, [summary.get_shape()[1], summary.get_shape()[2]])
finalSummary = ""
for i in range(len(summary)):
    currentWord = word_to_id(argmax(summary[i]), tokenizer)
    if currentWord=="startseq":
        print("hello")
        continue
    elif currentWord=="endseq":
        print("hello")
        break
    else:
        finalSummary = finalSummary + " " + currentWord
print("The generated summary is: " + finalSummary)    

The generated summary is:  to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to


RUNNING INFERENCE

We use the encoder decoder architecture trained previously to run inference on an input sequence of words (which is padded), we use a greedy approach to picking the next word.

In [89]:
#Inference Stage
import numpy as np
#encoder model
# this function recovers the original word from a token (if there is a translation)
def id_to_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
#encoder-decoder model
encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
decoder_state_input_h = tf.keras.layers.Input(shape=(lstm_dim,))
decoder_state_input_c = tf.keras.layers.Input(shape=(lstm_dim,))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

final_dex2 = tf.keras.layers.Embedding(max_summary_length, embedding_size)(decoder_inputs)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_state_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = tf.keras.models.Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs2] + decoder_states2)

def decode_seq(input_seq):
    state_values = encoder_model.predict(input_seq)

    target_seq = np.zeros((1,1))
    target_seq[0,0] = tokenizer.texts_to_sequences(['startseq'])[0][0]

    stop_condition = False
    decoded_sentence = ''
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + state_values)

        sampled_token_index = np.argmax(output_tokens[0,-1,:])
        if sampled_token_index != 0:
            print(id_to_word(sampled_token_index, tokenizer))
        sampled_char = id_to_word(sampled_token_index, tokenizer)
        if sampled_char == None:
            continue
        decoded_sentence += ' ' + sampled_char

        if(sampled_char == 'endseq' or len(decoded_sentence) > max_summary_length):
            stop_condition = True

        target_seq = np.zeros((1,1))
        target_seq[0,0] = sampled_token_index

        state_values = [h,c] 

    return decoded_sentence
inputSentence = tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([testDataset[list(testDataset.keys())[3]][0]])[0]], maxlen = max_length, padding='post')
print(decode_seq(inputSentence))

startseq


KeyboardInterrupt: 

EVALUATION 

We evaluate the dataset using BLEU-1 score and we do this by using the inference function described above. We ran evaluation using the test dataset which we set aside from the entire dataset in the beginning of the notebook.

In [167]:
def evaluate_model(dataset):
    # actual keeps track of correct sequence while predicted is the predicted sequence
    actual, predicted = list(), list()
    # looping through every image in dataset
    for key, texts in dataset.items():
        # prediction for current image
        prediction = decode_seq(tf.keras.preprocessing.sequence.pad_sequences([tokenizer.texts_to_sequences([texts[1]])[0]], maxlen = max_length, padding='post'))
        # get the correct descriptions and remove the startseq and endseq tokens
        correctSummary = texts[0].split()[1:-1]
        # append the prediction and the correct description to the lists
        actual.append(correctSummary)
        predicted.append(prediction.split())
    # calculate BLEU score for 1 ,2 ,3 and 4 GRAM
    print("Bleu-1 score: " + str(corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))))
    
evaluate_model(testDataset)    

KeyboardInterrupt: 