## Headline Generator

Purpose: to train a model to predict words in a headline, and use that model to create headlines of various lengths.

### Reading in the Data
The dataset consists of headlines from the New York Times newspaper over the course of several months. First, read in all the headlines from the articles.

In [2]:
import opendatasets as od
dataset = 'https://www.kaggle.com/datasets/aashita/nyt-comments'
od.download(dataset)

Skipping, found downloaded files in "./nyt-comments" (use force=True to force download)


In [3]:
import os
nyt_dir = './data/nyt-comments'
os.listdir(nyt_dir)

['ArticlesFeb2017.csv',
 'CommentsFeb2018.csv',
 'ArticlesApril2017.csv',
 'CommentsApril2018.csv',
 'ArticlesMarch2018.csv',
 'CommentsMarch2017.csv',
 'ArticlesMay2017.csv',
 'ArticlesJan2017.csv',
 'CommentsJan2018.csv',
 'CommentsMarch2018.csv',
 'ArticlesJan2018.csv',
 'CommentsMay2017.csv',
 'CommentsJan2017.csv',
 'ArticlesMarch2017.csv',
 'CommentsApril2017.csv',
 'ArticlesFeb2018.csv',
 'CommentsFeb2017.csv',
 'ArticlesApril2018.csv']

In [4]:
import pandas as pd

all_headlines = []  # list retainer for headlines

for filename in os.listdir(nyt_dir):
    if 'Articles' in filename:
        # read in all relevant data from the CSV file
        headlines_df = pd.read_csv(nyt_dir + '/' + filename)
        # add all headlines to the list
        all_headlines.extend(list(headlines_df.headline.values))

len(all_headlines)

9335

In [5]:
all_headlines[:20]

['N.F.L. vs. Politics Has Been Battle All Season Long',
 'Voice. Vice. Veracity.',
 'A Stand-Up’s Downward Slide',
 'New York Today: A Groundhog Has Her Day',
 'A Swimmer’s Communion With the Ocean',
 'Trail Activity',
 'Super Bowl',
 'Trump’s Mexican Shakedown',
 'Pence’s Presidential Pet',
 'Fruit of a Poison Tree',
 'The Peculiar Populism of Donald Trump',
 'Questions for: ‘On Alaska’s Coldest Days, a Village Draws Close for Warmth’',
 'The New Kids',
 'What My Chinese Mother Made',
 'Do You Think Teenagers Can Make a Difference in the World?',
 'Unknown',
 'President Pledges to Let Politics Return to Pulpits',
 'The Police Killed My Unarmed Son in 2012. I’m Still Waiting for Justice.',
 'Video of Sheep Slaughtering Ignites a Dispute',
 'This Will Change Your Mind']

### Cleaning the Data
Since we are dealing with a natural language processing (NLP) task, we need to process the text in a way that computers can understand it with tokenization - representing each word with a number.

In [6]:
# remove all headlines with the value of "Unknown"
all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

8603

In [7]:
all_headlines[:20]

['N.F.L. vs. Politics Has Been Battle All Season Long',
 'Voice. Vice. Veracity.',
 'A Stand-Up’s Downward Slide',
 'New York Today: A Groundhog Has Her Day',
 'A Swimmer’s Communion With the Ocean',
 'Trail Activity',
 'Super Bowl',
 'Trump’s Mexican Shakedown',
 'Pence’s Presidential Pet',
 'Fruit of a Poison Tree',
 'The Peculiar Populism of Donald Trump',
 'Questions for: ‘On Alaska’s Coldest Days, a Village Draws Close for Warmth’',
 'The New Kids',
 'What My Chinese Mother Made',
 'Do You Think Teenagers Can Make a Difference in the World?',
 'President Pledges to Let Politics Return to Pulpits',
 'The Police Killed My Unarmed Son in 2012. I’m Still Waiting for Justice.',
 'Video of Sheep Slaughtering Ignites a Dispute',
 'This Will Change Your Mind',
 'Busy Start for a President, and That Was in 1933']


We also want remove punctuation and make our sentences lowercase. This will decrease the number of unique words/tokens, and the model will be easier to train. Filtering the sentences for punctuation and uppercase letters can be done through Keras Tokenizer.

### Tokenization
With tokenization, a piece of text is separated into smaller chunks - in this case, words. Each unique word is assigned a number.

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer

# tokenize the words in the headlines
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1
print('Total words: ', total_words)

Total words:  11753


Can use word_index dictionary to see how the tokenizer saves the words.

In [9]:
# print a subset of the word_index dictionary created by Tokenizer
subset_dict = {key: value for key, value in tokenizer.word_index.items() \
                if key in ['a', 'man', 'a', 'plan', 'a', 'canal', 'panama']}
print(subset_dict)  # print a subset to visualize the tokenization

{'a': 2, 'plan': 82, 'man': 139, 'panama': 2931, 'canal': 5487}


Can use texts_to_sequences method to see how the tokenizer saves the words.

In [10]:
tokenizer.texts_to_sequences(['a', 'man', 'a', 'plan', 'a', 'canal', 'panama'])

[[2], [139], [2], [82], [2], [5487], [2931]]

### Creating Sequences
Now that the data is tokenized, we will need to create a sequence of tokens from the headlines. These sequences will be used to train the deep learning model.

In [11]:
# convert data to sequence of tokens
input_sequences = []
for line in all_headlines:
    # convert the headline into a sequence of tokens
    token_list = tokenizer.texts_to_sequences([line])[0]

    # create a series of sequences for each headline
    for i in range(1, len(token_list)):
        partial_sequence = token_list[:i+1]
        input_sequences.append(partial_sequence)

print(tokenizer.sequences_to_texts(input_sequences[:5]))
input_sequences[:5]

['n f', 'n f l', 'n f l vs', 'n f l vs politics', 'n f l vs politics has']


[[193, 125],
 [193, 125, 253],
 [193, 125, 253, 157],
 [193, 125, 253, 157, 226],
 [193, 125, 253, 157, 226, 83]]

### Padding Sequences
These sequences are of various lengths. For the model to be able to train on the data, all the sequences need to be the same length. This can be done by adding padding to the sequences with Keras pad_sequences method.

In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# determine max sequence length
max_sequence_len = max([len(x) for x in input_sequences])

# pad all sequences with zeros at the beginning to make them all max length
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       193, 125], dtype=int32)

### Creating Predictors and Target
We need to split up the sequences into predictors and a target. The last word of the sequence will be the target, and the first words of the sequence will be the predictors. For example, in "nvidia releases ampere graphics cards", the predictors are "nvidia releases ampere graphics" and the target/label is "cards".

In [13]:
# predictors are every word except the last
predictors = input_sequences[:,:-1]
# labels are the last word
labels = input_sequences[:,-1]
labels[:5]

array([125, 253, 157, 226,  83], dtype=int32)

The targets are categorical. We are predicting one word out of our possible total vocabulary. Instead of the model predicting scalar numbers, we will have it predict binary categories.

In [14]:
from tensorflow.keras import utils

labels = utils.to_categorical(labels, num_classes=total_words)

### Creating the Model
Embedding Layer: This layer will take the tokenized sequences and will learn an embedding for all of the words in the training dataset. The goal of this is to reduce the number of dimensions of the features. In this case, it will represent each word as a vector, and the information within that vector will contain the relationships between each word.

Long Short Term Memory (LSTM) Layer: This is a type of recurrent neural network (RNN). While traditional RNNs allow information from more recent layers to contribute more than information from further back, LSTMs are a type of recurrent layer that are able to learn and retain longer term information.

In [15]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential

# input is max length -1, as we've removed the last word for the label
input_len = max_sequence_len - 1

model = Sequential()

# add input embedding layer
model.add(Embedding(total_words, 10, input_length=input_len))

# add LSTM layer with 100 units
model.add(LSTM(100))
model.add(Dropout(0.1))

# add output layer
model.add(Dense(total_words, activation='softmax'))



In [16]:
model.summary()

### Compiling and Training the Model
As we are categorically predicting one word from our total vocab, we will choose categorical crossentropy. Note: we do not choose accuracy as a metric because text prediction is not measured as being more or less accurate in the same way as image classification.

We will set the optimizer as a Adam optimizer, which is well suited for LSTM tasks.

In [17]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [18]:
model.fit(predictors, labels, epochs=10, verbose=1)

Epoch 1/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 12ms/step - loss: 8.0648
Epoch 2/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 11ms/step - loss: 7.4853
Epoch 3/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 10ms/step - loss: 7.2626
Epoch 4/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 11ms/step - loss: 7.0074
Epoch 5/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 11ms/step - loss: 6.7484
Epoch 6/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 11ms/step - loss: 6.4922
Epoch 7/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 11ms/step - loss: 6.2163
Epoch 8/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 12ms/step - loss: 5.9977
Epoch 9/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 11ms/step - loss: 5.7629
Epoch 10/10
[1m1666/1666[0m [32m━━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x2b395f7a0>

### Making Predictions
To make predictions, we need to start with a seed text and prepare it in the same way that we prepared our dataset (tokenizing and padding). We can create a function to do this.

In [32]:
def predict_next_token(seed_text):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]  # convert to tokens
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')  # pad the lists of tokens before entering into model
    prediction = model.predict(token_list, verbose=0)  # run the input through the model for prediction
    predicted_classes = np.argmax(prediction, axis=1)  # for multi-class classification
    return predicted_classes

In [33]:
prediction = predict_next_token("today in new york") 
prediction

array([7])

In [34]:
tokenizer.sequences_to_texts([prediction])

['and']

### Generate New Headlines
With a function that helps with predicting new words from seed texts, we will create a function that can predict headlines of more than just one word.

In [35]:
def generate_headline(seed_text, next_words=1):
    for _ in range(next_words):
        # predict next token
        prediction = predict_next_token(seed_text)
        # convert token to word
        next_word = tokenizer.sequences_to_texts([prediction])[0]
        # add next word to the headline. This headline will be used in the next pass of the loop.
        seed_text += " " + next_word
    # return headline as title-case
    return seed_text.title()

In [36]:
seed_texts = [
    'washington dc is',
    'today in new york',
    'the school district has',
    'crime has become']
for seed in seed_texts:
    print(generate_headline(seed, next_words=5))

Washington Dc Is A New York Of A
Today In New York And The City Of A
The School District Has The New York Times For
Crime Has Become A World On The City


The results are underwhelming after 10 epochs of training. The headlines make some kind of grammatical sense, but don't show a good contextual understanding. The results could improve by running more epochs. 

We could also improve the results by using pre-trained embeddings with Word2Vec or GloVe, rather than learning them during training with the Keras Embedding layer.

It's important to note that NLP has moved beyond simple LSTM models to Transformer-based pre-trained models, which are able to learn language context from huge amounts of textual data. These pre-trained models are then used as a starting point for transfer learning to solve NLP tasks.