# Text Summarization of news articels
In this notebook I will write summaries with the help of my Seq2Seq model in Summarizer.py.


In [3]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from collections import Counter

import Summarizer
import summarizer_data_utils
import summarizer_model_utils


In [2]:
print(tf.__version__)

1.8.0


## The data
The data we will use here is the 'all-the-news'-dataset from Kaggle. It contains about 200000 news articles and the headlines of those articles. The headlines will serve as our summaries in this case.
The articles are from several big news corporations.

https://www.kaggle.com/snapcrack/all-the-news



### Reading and exploring

In [4]:
# the dataset consists of 3 .csv files. we will concatenate them.
data = pd.read_csv('./articles1.csv',
                   encoding='utf-8')
data1 = pd.read_csv('./articles2.csv',
                    encoding='utf-8')
data2 = pd.read_csv('./articles3.csv',
                    encoding='utf-8')


In [6]:
data = pd.concat([data, data1, data2])
data.shape

(235140, 10)

In [7]:
# we are only going to use title and content.
data.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [8]:
Counter(data.publication)

Counter({'Atlantic': 14187,
         'Breitbart': 23781,
         'Business Insider': 6757,
         'Buzzfeed News': 9708,
         'CNN': 11488,
         'Fox News': 8708,
         'Guardian': 17362,
         'NPR': 23984,
         'National Review': 12406,
         'New York Post': 34986,
         'New York Times': 7803,
         'Reuters': 21420,
         'Talking Points Memo': 10428,
         'Vox': 9894,
         'Washington Post': 22228})

In [9]:
data = data[data.publication != 'Breitbart']

In [10]:
# Two titles are missing. 
data.isnull().sum()

Unnamed: 0         0
id                 0
title              4
publication        0
author         25446
date            5282
year            5282
month           5282
url            40241
content            0
dtype: int64

In [11]:
# drop those. 
data.dropna(subset=['title'], inplace = True)

In [12]:
# to make the transition from the amazon review example to this one as comfortable as possbile we just rename 
# the columns. 
data.rename(index = str, columns = {'title':'Summary', 'content':'Text'}, inplace = True)
data = data[['Summary', 'Text']]

In [13]:
data = data[['Summary', 'Text']]
data.head()

Unnamed: 0,Summary,Text
0,House Republicans Fret About Winning Their Hea...,WASHINGTON — Congressional Republicans have...
1,Rift Between Officers and Residents as Killing...,"After the bullet shells get counted, the blood..."
2,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...","When Walt Disney’s “Bambi” opened in 1942, cri..."
3,"Among Deaths in 2016, a Heavy Toll in Pop Musi...","Death may be the great equalizer, but it isn’t..."
4,Kim Jong-un Says North Korea Is Preparing to T...,"SEOUL, South Korea — North Korea’s leader, ..."


In [14]:
# let's have a look. 
for x in data.Summary[:10]:
    print(x)

House Republicans Fret About Winning Their Health Care Suit - The New York Times
Rift Between Officers and Residents as Killings Persist in South Bronx - The New York Times
Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial Bias, Dies at 106 - The New York Times
Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times
Kim Jong-un Says North Korea Is Preparing to Test Long-Range Missile - The New York Times
Sick With a Cold, Queen Elizabeth Misses New Year’s Service - The New York Times
Taiwan’s President Accuses China of Renewed Intimidation - The New York Times
After ‘The Biggest Loser,’ Their Bodies Fought to Regain Weight - The New York Times
First, a Mixtape. Then a Romance. - The New York Times
Calling on Angels While Enduring the Trials of Job - The New York Times


In [15]:
data.Text[0]

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 

In [16]:
# again we will not use all of the examples, but only pick some. 
len_summaries = [len(summary) for i, summary in enumerate(data.Summary)]
len_texts = [len(text) for text in data.Text]

In [17]:
len_summaries_counted = Counter(len_summaries).most_common()
len_texts_counted = Counter(len_texts).most_common()
len_summaries_counted[:10], len_texts_counted[:10]

([(63, 5473),
  (64, 5440),
  (60, 5434),
  (59, 5372),
  (62, 5372),
  (58, 5315),
  (61, 5257),
  (65, 5220),
  (66, 4992),
  (67, 4982)],
 [(2878, 67),
  (2715, 66),
  (3976, 64),
  (2457, 61),
  (2935, 61),
  (2889, 61),
  (2858, 61),
  (4191, 60),
  (3612, 60),
  (2617, 59)])

In [18]:
# we will only use shorter texts, as I have limited resources and those are easier to learn
indices = [ind for ind, text in enumerate(data.Text) if 50 < len(text) < 200]

In [None]:
len(indices), len(texts_unprocessed), len(summaries_unprocessed)

In [20]:
# articles from nyt and breitbart seem to have those endings, therefore
# we will remove those, as that is not relevant. 
to_remove = ['- The New York Times', '- Breitbart']

summaries_unprocessed_clean = []
texts_unprocessed_clean = []

removed = 0
append = True
for sentence in summaries_unprocessed:
    append = True
    for r in to_remove:
        if sentence.endswith(r):
            sentence = sentence.replace(r, '.')
            summaries_unprocessed_clean.append(sentence.replace(r, '.'))
            removed+=1
            append = False
            break
            
    if append:
        summaries_unprocessed_clean.append(sentence)
       


In [249]:
len(summaries_unprocessed_clean), len(texts_unprocessed)

(725, 725)

### Clean and prepare the data

In [21]:
# preprocess the texts and summaries.
# we have the option to keep_most or not. in this case we do not want 'to keep most', i.e. we will only keep
# letters and numbers. 
# (to improve the model, this preprocessing step should be refined)
processed_texts, processed_summaries, words_counted = summarizer_data_utils.preprocess_texts_and_summaries(
    texts_unprocessed,
    summaries_unprocessed_clean,
    keep_most=False)

Processing Time:  0.5395939350128174


In [24]:
# some of the texts are empty remove those. 
processed_texts_clean = []
processed_summaries_clean = []

for t, s in zip(processed_texts, processed_summaries):
    if t != [] and s != []:
        processed_texts_clean.append(t)
        processed_summaries_clean.append(s)

### Create lookup dicts

We cannot feed our network actual words, but numbers. So we first have to create our lookup dicts, where each words gets and int value (high or low, depending on its frequency in our corpus). Those help us to later convert the texts into numbers.

We also add special tokens. EndOfSentence and StartOfSentence are crucial for the Seq2Seq model we later use.
Pad token, because all summaries and texts in a batch need to have the same length, pad token helps us do that.

So we need 2 lookup dicts:
 - From word to index 
 - from index to word. 

In [307]:
# create lookup dicts.
# most oft the words only appear only once. 
# min_occureces set to 2 reduces our vocabulary by more than half.
specials = ["<EOS>", "<SOS>","<PAD>","<UNK>"]
word2ind, ind2word,  missing_words = summarizer_data_utils.create_word_inds_dicts(words_counted,
                                                                                  specials = specials,
                                                                                  min_occurences = 2)
print(len(word2ind), len(ind2word), len(missing_words))


2033 2033 2489


### Pretrained embeddings

Optionally we can use pretrained word embeddings. Those have proved to increase training speed and accuracy.
Here I used two different options. Either we use glove embeddings or embeddings from tf_hub.
The ones from tf_hub worked better.

In [0]:
# glove_embeddings_path = '/Users/thomas/Jupyter_Notebooks/Pro Deep Learning with Tensorflow/Notebooks/glove/glove.6B.300d.txt'
# embedding_matrix_save_path = './embeddings/my_embedding.npy'
# emb = summarizer_data_utils.create_and_save_embedding_matrix(word2ind,
#                                                        glove_embeddings_path,
#                                                        embedding_matrix_save_path)

In [310]:
# the embeddings from tf.hub. 
# embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
embed = hub.Module("https://tfhub.dev/google/Wiki-words-250/1")
emb = embed([key for key in word2ind.keys()])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    embedding = sess.run(emb)

INFO:tensorflow:Initialize variable module_2/embeddings/part_0:0,module_2/embeddings/part_1:0 from checkpoint b'/tmp/tfhub_modules/1e7e72950396d90315d3d9c57eddf5da44d4dca2/variables/variables' with embeddings


In [311]:
embedding.shape

(2033, 250)

In [0]:
np.save('./tf_hub_embedding_headlines.npy', embedding)

### Convert text and summaries
As I said before we cannot feed the words directly to our network, we have to convert them to numbers first of all. This is what we do here. And we also append the SOS and EOS tokens.

In [0]:
# converts words in texts and summaries to indices
converted_texts, unknown_words_in_texts = summarizer_data_utils.convert_to_inds(processed_texts_clean,
                                                                                word2ind,
                                                                                eos = False)

In [0]:
converted_summaries, unknown_words_in_summaries = summarizer_data_utils.convert_to_inds(processed_summaries_clean,
                                                                                        word2ind,
                                                                                        eos = True,
                                                                                        sos = True)

In [316]:
# seems to have worked well. 
print(summarizer_data_utils.convert_inds_to_text(converted_texts[0], ind2word))
print(summarizer_data_utils.convert_inds_to_text(converted_summaries[0], ind2word))


['in', 'a', 'major', 'abortion', 'ruling', 'monday', 'the', 'supreme', 'court', '<UNK>', 'down', '<UNK>', 'of', 'a', 'texas', 'law', 'that', 'would', 'have', '<UNK>', 'dozens', 'of', '<UNK>', 'to', 'close', 'here', 'are', 'reactions', 'from', 'all', '<UNK>', 'of', 'the', '<UNK>']
['<SOS>', 'reactions', 'to', 'the', 'supreme', 'court', 'ruling', 'on', 'texas', 'abortion', 'law', '<EOS>']


## The model

Now we can build and train our model. First we define the hyperparameters we want to use. Then we create our Summarizer and call the function .build_graph(), which as the name suggests, builds the computation graph. 
Then we can train the model using .train()

After training we can try our model using .infer()

### Training
Unfortunately I do not have the resources to find the perfect (or right) hyperparameters, but these do pretty well. 

I trained the model for about 40 epochs. the training loss, as well as the validation loss were both still declining.
I chose to use 90% of the data as trainign set and 10% as validation set. We could have also used sklearn's train_test_split here. 

In [0]:
# model hyperparameters
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 300
rnn_size_decoder = 300

batch_size = 32
epochs = 100
clip = 5
keep_probability = 0.8
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 100
learning_rate_decay = 0.90


pretrained_embeddings_path = './tf_hub_embedding_headlines.npy'
summary_dir = os.path.join('./tensorboard/headlines')

use_cyclic_lr = True
inference_targets=True


In [None]:
# build graph and train the model 
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   save_path='./models/headlines/my_model',
                                   mode='TRAIN',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   batch_size = batch_size,
                                   clip = clip,
                                   keep_probability = keep_probability,
                                   learning_rate = learning_rate,
                                   max_lr=max_lr,
                                   learning_rate_decay_steps = learning_rate_decay_steps,
                                   learning_rate_decay = learning_rate_decay,
                                   epochs = epochs,
                                   pretrained_embeddings_path = pretrained_embeddings_path,
                                   use_cyclic_lr = use_cyclic_lr,)
#                                    summary_dir = summary_dir)           

summarizer.build_graph()
summarizer.train(converted_texts, 
                 converted_summaries)


### Inference
Now we can use our trained model to create summaries. Here we are clearly overfitting, as we only trained on 700 examples. (i.e. the model does not generalize at all.)


In [323]:
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   './models/headlines/my_model',
                                   'INFER',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   batch_size = len(converted_texts[:50]),
                                   clip = clip,
                                   keep_probability = 1.0,
                                   learning_rate = 0.0,
                                   beam_width = 5,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   inference_targets = False,
                                   pretrained_embeddings_path = pretrained_embeddings_path)

summarizer.build_graph()
preds = summarizer.infer(converted_texts[:50],
                         restore_path =  './models/headlines/my_model',
                         targets = converted_summaries[:50])




Loaded pretrained embeddings.
Graph built.
INFO:tensorflow:Restoring parameters from ./models/headlines/my_model
Done.


In [324]:
# show results
summarizer_model_utils.sample_results(preds,
                                      ind2word,
                                      word2ind,
                                      converted_summaries[:50],
                                      converted_texts[:50])




 ----------------------------------------------------------------------------------------------------
Actual Text:
in a major abortion ruling monday the supreme court <UNK> down <UNK> of a texas law that would have <UNK> dozens of <UNK> to close here are reactions from all <UNK> of the <UNK>

Actual Summary:
reactions to the supreme court ruling on texas abortion law

Created Summary:
reactions to the supreme court ruling ruling texas abortion abortion





 ----------------------------------------------------------------------------------------------------
Actual Text:
christmas <UNK> at the <UNK> <UNK> through <UNK> and <UNK> <UNK> and <UNK> make wonderland out of this <UNK> <UNK> mary j <UNK> christmas in the city

Actual Summary:
<UNK> christmas <UNK>

Created Summary:
<UNK> christmas <UNK>





 ----------------------------------------------------------------------------------------------------
Actual Text:
cnn cnn opinion is curating tweets from our contributors during preside

# Conclusion

Generally I am really impressed by how well the model works. 
We only used a limited amount of data, trained it for a limited amount of time and used nearly random hyperparameters and it still delivers good results. 

However, we are clearly overfitting the training data and the model does not perfectly generalize.
Sometimes the summaries the model creates are good, sometimes bad, sometimes they are better than the original ones and sometimes they are just really funny.


Therefore it would be really interesting to scale it up and see how it performs. 

To sum up, I am impressed by seq2seq models, they perform great on many different tasks and I look foward to exploring more possible applications. 
(speech recognition...)