# News headline text generation model

Here, I use a LSTM neural network to create a generative model for text, which in this case is news headlines. LSTM recurrent neural networks are commonly used for text generation becuase they are able to learn the sequences of a given problem domain, and then generate entirely new and plausible sequences for that domain. 

I first develop a simple LSTM network, trained on a Kaggle dataset and data retrieved using the New York Times API, that will learn sequences of words for text generation. I then use this model to generate new sequences of words (i.e., headlines). Because LSTM networks can be slow to train, I've trained the networks in this notebook using an AWS GPU instance. 

## 1) Import the necessary libraries.


In [1]:
# MISC data science libraries
import pandas as pd
import numpy as np
import requests
import pickle
from glob import glob


import matplotlib.pyplot as plt
%matplotlib inline


# keras modules for preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras.utils as ku 
# keras modules for building LSTM 
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
# keras modules for LSTM training
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.models import load_model


# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

Using TensorFlow backend.


## 2) Load data

### 1. Load the [kaggle dataset](https://www.kaggle.com/aashita/nyt-comments) of New York Times comments and headlines
The NYT dataset downloaded from Kaggle contains nine months of data, from the months January - May 2017 and January - April 2018. 

In [2]:
# explore the dataset using just one month of data

April2018_headlines = pd.read_csv('~/take_home_tests/SynapseFI/data/articles/ArticlesApril2018.csv')
April2018_headlines.head()

Unnamed: 0,articleID,articleWordCount,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL
0,5adf6684068401528a2aa69b,781,By JOHN BRANCH,article,Former N.F.L. Cheerleaders’ Settlement Offer: ...,"['Workplace Hazards and Violations', 'Football...",68,Sports,0,2018-04-24 17:16:49,Pro Football,"“I understand that they could meet with us, pa...",The New York Times,News,https://www.nytimes.com/2018/04/24/sports/foot...
1,5adf653f068401528a2aa697,656,By LISA FRIEDMAN,article,E.P.A. to Unveil a New Rule. Its Effect: Less ...,"['Environmental Protection Agency', 'Pruitt, S...",68,Climate,0,2018-04-24 17:11:21,Unknown,The agency plans to publish a new regulation T...,The New York Times,News,https://www.nytimes.com/2018/04/24/climate/epa...
2,5adf4626068401528a2aa628,2427,By PETE WELLS,article,"The New Noma, Explained","['Restaurants', 'Noma (Copenhagen, Restaurant)...",66,Dining,0,2018-04-24 14:58:44,Unknown,What’s it like to eat at the second incarnatio...,The New York Times,News,https://www.nytimes.com/2018/04/24/dining/noma...
3,5adf40d2068401528a2aa619,626,By JULIE HIRSCHFELD DAVIS and PETER BAKER,article,Unknown,"['Macron, Emmanuel (1977- )', 'Trump, Donald J...",68,Washington,0,2018-04-24 14:35:57,Europe,President Trump welcomed President Emmanuel Ma...,The New York Times,News,https://www.nytimes.com/2018/04/24/world/europ...
4,5adf3d64068401528a2aa60f,815,By IAN AUSTEN and DAN BILEFSKY,article,Unknown,"['Toronto, Ontario, Attack (April, 2018)', 'Mu...",68,Foreign,0,2018-04-24 14:21:21,Canada,"Alek Minassian, 25, a resident of Toronto’s Ri...",The New York Times,News,https://www.nytimes.com/2018/04/24/world/canad...


In [10]:
# load kaggle data

curr_dir = '/Users/laurenfinkelstein/take_home_tests/SynapseFI/data/articles/'

kaggle_headlines = []
headline_counts = []

for file in glob(curr_dir + '*.csv'):
    article_df = pd.read_csv(file)
    kaggle_headlines.extend(list(article_df.headline.values))
    headline_counts.append(len(article_df.headline))

kaggle_headlines = [h for h in kaggle_headlines if h != "Unknown"]
len(kaggle_headlines)

8603

In [11]:
# calculate number of headlines in an average month in the Kaggle dataset

mean_headlines = round(np.mean(headline_counts))
print('Mean number of headlines in each month of Kaggle data:', mean_headlines)

Mean number of headlines in each month of Kaggle data: 1037.0


### 2. Augment the data obtained from Kaggle with more data using the New York Times API

The New York Times headlines data downloaded from Kaggle includes data from the months of January - May 2017 and January - April 2018 (and hence is missing data from May 2018). To compensate for this missing month of data, I downloaded data directly using the New York Times Archive API from May 2018. 

In [12]:
def get_NYT_headlines(year, month):
    
    '''
     This function makes a request to the New York Times Archive API and collects 
     a list of article headlines for the specified month and year of interest.
    '''

    api_key = {'api-key' : pickle.load(open('apikey.pkl','rb'))}
    url = 'https://api.nytimes.com/svc/archive/v1/' + str(year) + '/' + str(month) + '.json'
    
    response = requests.get(url, params=api_key)
    output = response.json()
    
    docs = output['response']['docs']
    
    headlines = []
    for doc in docs:
        headlines.append(doc['headline']['main'])
    
    return headlines

In [13]:
may_2018_headlines = get_NYT_headlines(2018, 5)
may_2018_headlines[:10]
print ('Number of headlines in the NYT May 2018 Archive:', len(may_2018_headlines))

Number of headlines in the NYT May 2018 Archive: 7421


There are many more headlines for May 2018 from the NYT Archive than for a given month from the Kaggle dataset. As to not bias the training data from Kaggle's dataset towards the NYT Archive May 2018 data, I will randomly sample a quantity of May 2018 headlines equal to the number of headlines in an average month of the Kaggle dataset. 

In [14]:
index = np.random.choice(range(1,len(may_2018_headlines)), int(mean_headlines))
may_2018_rand = [may_2018_headlines[i] for i in index]
may_2018_rand[:5]

['With ‘Kudos,’ Rachel Cusk Completes an Exceptional Trilogy',
 '’80s Beauty Products That Are Still Beloved Today',
 'How Tech Can Turn Doctors Into Clerical Workers',
 'Every 202,500 Years, Earth Wanders in a New Direction',
 'How the N.R.A. Fought Gun Control After Parkland']

In [15]:
print("Number of May 2018 headlines retrieved from NYT archive:", len(may_2018_rand))

Number of May 2018 headlines retrieved from NYT archive: 1037


### 3. Merge Kaggle data with NYT Archive data

We now need to merge the data retrieved from the NYT Archive with the data downloaded from Kaggle. 

In [16]:
# merge Kaggle data with NYT Archive data

all_headlines = kaggle_headlines + may_2018_rand
len(all_headlines)

9640

Now that we've merged the headlines downloaded from Kaggle and the headlines downloaded using the NYT API together, we should check for duplicates and remove those from our list of headlines we will use to train the model. 

In [17]:
# check for duplicates

len(all_headlines) == len(set(all_headlines))
print("Number of duplicate headlines:", len(all_headlines) - len(set(all_headlines)))

Number of duplicate headlines: 449


In [18]:
# remove duplicates in all_headlines

seen = set()
all_headlines_uniq = []
for headline in all_headlines:
    if headline not in seen:
        all_headlines_uniq.append(headline)
        seen.add(headline)

In [19]:
# great, no more duplicate headlines

len(all_headlines_uniq) == len(set(all_headlines_uniq))

True

In [20]:
print("Number of headlines in the training dataset:", len(all_headlines_uniq))

Number of headlines in the training dataset: 9191


## 3) Preprocess the data

### Clean and tokenize the data

Preprocessing the data consists of both cleaning and tokenizing the data. 

To clean the data, we will remove punctuation and lowercase all words in the corpus. We do not need to worry about removing stop words, as we do in many NLP projects, because we want to the model to generate fluid headlines similar to those that would be created by a human. 

To prepare the data for modeling by the neural network, we must convert words to integers (i.e., tokenize the data), as the words cannot be modeled directly. Tokenization is the process of extracting tokens (i.e., terms/words) from a corpus. 

We will clean and tokenize the corpus in the same step, using the Keras library's Tokenizer method. The tokenizer removes all punctuation including tabs and linebreaks, and converts text to lowercase. It then tokenizes/vectorizes all text in the corpus by turning each text into a sequence of integers/tokens. This step is important, as language modelling takes in sequential data (i.e., words/tokens), and uses this sequential data to predict the next word/token. 

### Generate sequence of n-gram tokens

Once we've cleaned and tokenized the corpus, we will generate sequences of n-gram tokens that will be used to train the model. Each integer in these n-grams is the index of the given word in the complete vocabulary of words present in the corpus of text.

For example, one of the headlines in the corpus is "Finding an expansive view of a forgotten people in niger". For this headline, the **sequences of n-gram tokens** will look like this: 

    [169, 17],
    [169, 17, 665],
    [169, 17, 665, 367],
    [169, 17, 665, 367, 4],
    [169, 17, 665, 367, 4, 2],
    [169, 17, 665, 367, 4, 2, 666],
    [169, 17, 665, 367, 4, 2, 666, 170],
    [169, 17, 665, 367, 4, 2, 666, 170, 5],
    [169, 17, 665, 367, 4, 2, 666, 170, 5, 667]

which, respectively, equate to the following **n-grams of text**: 

    Finding an,
    Finding an expansive,
    Finding an expansive view,
    Finding an expansive view of,
    Finding an expansive view of a,
    Finding an expansive view of a forgotten,
    Finding an expansive view of a forgotten people,
    Finding an expansive view of a forgotten people in,
    Finding an expansive view of a forgotten people in niger

In [21]:
corpus = all_headlines_uniq

# Clean and tokenize using Keras' built-in Tokenizer() method
tokenizer = Tokenizer() # create the tokenizer
tokenizer.fit_on_texts(corpus) # fit the tokenizer on the documents

# convert data to sequence of tokens
input_sequences = [] # sequences of n-grams from all documents in the corpus
for headline in corpus:
    token_list = tokenizer.texts_to_sequences([headline])[0] # list of tokens corresponding to each word in the document (i.e., news headline)
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1] # creates n-gram sequence (from 2 word n-grams, through n-grams = len(headline))
        input_sequences.append(n_gram_sequence)

print("Printing the first 5 input sequences...")
print(input_sequences[:10])

nb_samples = sum(len(s) for s in input_sequences) # total number of samples in input_sequences; 
print("\nTotal number of samples in input_sequences:", nb_samples)

# Vocab size
vocab_size = len(tokenizer.word_index) + 1
print("\nTotal number of words in the vocabulary:", vocab_size)

Printing the first 5 input sequences...
[[394, 18], [394, 18, 5626], [394, 18, 5626, 553], [394, 18, 5626, 553, 4], [394, 18, 5626, 553, 4, 2], [394, 18, 5626, 553, 4, 2, 1502], [394, 18, 5626, 553, 4, 2, 1502, 159], [394, 18, 5626, 553, 4, 2, 1502, 159, 5], [394, 18, 5626, 553, 4, 2, 1502, 159, 5, 2180], [7, 74]]

Total number of samples in input_sequences: 322255

Total number of words in the vocabulary: 12861


### Pad the sequences

Up to this point, we've generated a data-set containing a sequence of tokens. It's important to recognize that different sequences may have different lengths. Before training the model, we must ensure all sequence lengths are equal. For this, we can pad the sequences using the Keras pad_sequence function. 

In [22]:
max_sequence_len = max([len(s) for s in input_sequences]) # find length of the longest input sequence (i.e., headline)
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')) # pads the input sequences and converts to a numpy array (must fit model on np arrays)

 ### Obtain variables (predictor and target)

Now, we need to define the training data for the network. In order to input data and train the model, we must create predictors and a label. We will use a n-gram sequence as predictors, and the next word of the n-gram as the label. 

For example, for the headline "Finding an expansive view of a forgotten people in niger", the first three **predictors** will be:

    1. Finding an
    2. Finding an expansive
    3. Finding an expansive view

and the corresponding **labels** will be

    1. expansive
    2. view
    3. of

... and so on. 

In [23]:
## create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1] # label is the last word of each n-gram, and predictors are all the preceeding words
label = ku.to_categorical(label, num_classes=vocab_size) # one-hot encodes the labels; matrix dimension are the number of input sequences by the number of total words/tokens in the word dictionary

Now we can obtain the input vector *X* and the label vector _Y_, which will be used to train the network. 

## 4) Build and train the model

We can now define our LSTM network. Let's start with a simple LSTM, and then later add more layers and increase the number of memory units and epochs. Then we can compare the accuracy, loss, and results to ultimately choose the best network. 

##### Model 1

We will start with a network consisting of a three layer archiecture, including:
       
***Input (Embedding) Layer*** : Takes the sequence of words as input, represented as dense vectors
<br /> ***LSTM Layer*** : Hidden layer; computes the output using LSTM units (100 units in this case)
<br /> ***Output (Dense) Layer*** : Dense layer; computes the probability of the best possible next word as the output
    
We will start by running the model for 100 epochs. 

In [26]:
input_len = max_sequence_len - 1 # length of the longest sequence - the length of the predictor (length=1)

In [27]:
## Create a Keras Sequential model 
model = Sequential()

## Specify the network architecture
model.add(Embedding(vocab_size, 50, input_length=input_len))
model.add(LSTM(100))
model.add(Dense(vocab_size, activation='softmax'))

In [28]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 27, 50)            643050    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dense_2 (Dense)              (None, 12861)             1298961   
Total params: 2,002,411
Trainable params: 2,002,411
Non-trainable params: 0
_________________________________________________________________


In [29]:
# Compile the network (sets up training parameters before training)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Now that we've created the model architecture, we can train it using our data (predictors and labels).

Because the network is slow to train, we can use model checkpointing to record all the network weights anytime an improvement in loss occurs at the end of an epoch. We can then use the best set of weights (i.e., the lowest loss) to later instantiate the text generative model for use. 

In [30]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
# train the model

model1_hist = model.fit(predictors, label, epochs=100, verbose=1, callbacks=callbacks_list)

In [None]:
pickle.dump(tokenizer, open('tokenizer.pkl', 'wb')) # pickle the tokenizer for use in external script
model.save('model_1.hdf5') # save the model for later use / use in external script

To speed up training, the network was trained on a AWS GPU instance. 

In [53]:
# load the model
model1 = load_model('model_1.hdf5')

In [None]:
def plot_learning_loss(model_history):
    
    '''
    This function plots the learning and loss curves for a neural network. 
    Takes model history as input. 
    '''
    
    plt.figure(figsize=(8,4))
    
    # plot learning curve
    plt.subplot(1,2,1)
    plt.plot(model_history.history['acc'])
    plt.title('Model Accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='upper left')
    plt.savefig('./accuracy_curve.png')
  
    # summarize history for loss
    plt.subplot(1,2,2)
    plt.plot(model_history.history['loss'])
    plt.title('Model Loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='upper left')
    plt.savefig('./loss_curve.png')
    
    plt.show()

In [None]:
plot_learning_loss(model1)

## 5) Use the model to generate text

Now that the model has been trained, we can use the network to generate text. We use a seed sequence as the input and predict the next word based on input words. 

To feed this seed text into the model, we will need to do some pre-processing, which includes (1) tokenizing the seed text and (2) padding the sequences, similar to the way we preprocessed the text sequences the model was trained on. We can then pass the sequences into the trained model to get the predicted word. The predicted words can then be appended together to obtain a predicted sequence. 

In [54]:
def generate_headline(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0] # list of tokens corresponding to each word in the document (i.e., news headline)
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') # add padding
        predicted = model.predict_classes(token_list, verbose=1) # make predictions using trained model
        
        output_word = ""
        for word,index in tokenizer.word_index.items(): # word index gives a dictionary of words and their uniquely assigned integers
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

##### Model 1 results

In [83]:
print (generate_headline("Donald Trump", 4, model1, max_sequence_len))
print (generate_headline("New York", 4, model1, max_sequence_len))
print (generate_headline("UK Brexit", 6, model1, max_sequence_len))
print (generate_headline("Vladimir Putin", 4, model1, max_sequence_len))
print (generate_headline("Opioid Crisis", 4, model1, max_sequence_len))
print (generate_headline("Barack Obama",4, model1, max_sequence_len))

Donald Trump Myth Of Women’S Stocks
New York For End And Subway
Uk Brexit Allow Class The Pay Our British
Vladimir Putin On Dies Of Needs
Opioid Crisis He Injury For Disaster
Barack Obama Of Trump’S Culture Disability


In observing the generated text from Model 1, we see that some of the sequences make sense, but not all are perfect. 

Some ideas to improve the quality of the results, by fine-tuning the network architecture and parameters, include:
- using a larger network
- increasing the input dimension in the embedding layer
- increasing the number of memory units in the LSTM layer(s)
- increasing the number of training epochs and decreasing the batch size

These improvement ideas will be experimented on in Models 2 and 3, below!

## 6) Next Steps

### Model 2

Let's try creating a larger network by adding a second LSTM layer to see if this improves the quality of the generated text. Let's start by keeping the number of memory units the same at 100, but adding a second LSTM layer. So the new model architecture becomes:

***Input (Embedding) Layer***
<br /> ***LSTM Layer***
<br /> ***LSTM Layer***
<br /> ***Output (Dense) Layer***
    

In [None]:
model2 = Sequential()
model2.add(Embedding(vocab_size, 50, input_length=input_len)) # try increasing output dim to 100
model2.add(LSTM(100, return_sequences=True))
model2.add(LSTM(100))
model2.add(Dense(vocab_size, activation='softmax'))

In [None]:
model2.summary()

In [None]:
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model2_hist = model2.fit(predictors, label, epochs=100, verbose=1, callbacks=callbacks_list)

In [None]:
model2.save('model_2.hdf5')

##### Model 2 results

In [None]:
print (generate_headline("Donald Trump", 4, model1, max_sequence_len))
print (generate_headline("New York", 4, model1, max_sequence_len))
print (generate_headline("UK Brexit", 6, model1, max_sequence_len))
print (generate_headline("Vladimir Putin", 4, model1, max_sequence_len))
print (generate_headline("Opioid Crisis", 4, model1, max_sequence_len))
print (generate_headline("Barack Obama",4, model1, max_sequence_len))

### Model 3

Let's use the same architecture as model 2, but try increasing the output dimension in the embedding layer to 100 and increasing number of memory units in the LSTM layers to 200. 

In [None]:
model3 = Sequential()
model3.add(Embedding(vocab_size, 100, input_length=input_len))
model3.add(LSTM(200, return_sequences=True))
model3.add(LSTM(200))
model3.add(Dense(vocab_size, activation='softmax'))

In [None]:
model3.summary()

In [None]:
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We can also try increasing the number of training epochs from 100 to 150 and decreasing the batch size from unspecified (i.e., 32) to 16. This affords the network a greater opportunity to learn and update its weights. 

In [None]:
model3_hist = model3.fit(predictors, label, epochs=150, batch_size=16, verbose=1, callbacks=callbacks_list)

In [None]:
model3.save('model_3.hdf5')

##### Model 3 results

In [None]:
print (generate_headline("Donald Trump", 4, model1, max_sequence_len))
print (generate_headline("New York", 4, model1, max_sequence_len))
print (generate_headline("UK Brexit", 6, model1, max_sequence_len))
print (generate_headline("Vladimir Putin", 4, model1, max_sequence_len))
print (generate_headline("Opioid Crisis", 4, model1, max_sequence_len))
print (generate_headline("Barack Obama",4, model1, max_sequence_len))