## Headline Generator

Purpose: to train a model to predict words in a headline, and use that model to create headlines of various lengths.

### Reading in the Data
The dataset consists of headlines from the New York Times newspaper over the course of several months. First, read in all the headlines from the articles.

In [2]:
import opendatasets as od
dataset = 'https://www.kaggle.com/datasets/aashita/nyt-comments'
od.download(dataset)

Dataset URL: https://www.kaggle.com/datasets/aashita/nyt-comments
Downloading nyt-comments.zip to ./nyt-comments


100%|████████████████████████████████████████| 480M/480M [00:19<00:00, 26.4MB/s]





In [4]:
import os
nyt_dir = './data/nyt-comments'
os.listdir(nyt_dir)

['ArticlesFeb2017.csv',
 'CommentsFeb2018.csv',
 'ArticlesApril2017.csv',
 'CommentsApril2018.csv',
 'ArticlesMarch2018.csv',
 'CommentsMarch2017.csv',
 'ArticlesMay2017.csv',
 'ArticlesJan2017.csv',
 'CommentsJan2018.csv',
 'CommentsMarch2018.csv',
 'ArticlesJan2018.csv',
 'CommentsMay2017.csv',
 'CommentsJan2017.csv',
 'ArticlesMarch2017.csv',
 'CommentsApril2017.csv',
 'ArticlesFeb2018.csv',
 'CommentsFeb2017.csv',
 'ArticlesApril2018.csv']

In [7]:
import pandas as pd

all_headlines = []  # list retainer for headlines

for filename in os.listdir(nyt_dir):
    if 'Articles' in filename:
        # read in all relevant data from the CSV file
        headlines_df = pd.read_csv(nyt_dir + '/' + filename)
        # add all headlines to the list
        all_headlines.extend(list(headlines_df.headline.values))

len(all_headlines)

9335

In [8]:
all_headlines[:20]

['N.F.L. vs. Politics Has Been Battle All Season Long',
 'Voice. Vice. Veracity.',
 'A Stand-Up’s Downward Slide',
 'New York Today: A Groundhog Has Her Day',
 'A Swimmer’s Communion With the Ocean',
 'Trail Activity',
 'Super Bowl',
 'Trump’s Mexican Shakedown',
 'Pence’s Presidential Pet',
 'Fruit of a Poison Tree',
 'The Peculiar Populism of Donald Trump',
 'Questions for: ‘On Alaska’s Coldest Days, a Village Draws Close for Warmth’',
 'The New Kids',
 'What My Chinese Mother Made',
 'Do You Think Teenagers Can Make a Difference in the World?',
 'Unknown',
 'President Pledges to Let Politics Return to Pulpits',
 'The Police Killed My Unarmed Son in 2012. I’m Still Waiting for Justice.',
 'Video of Sheep Slaughtering Ignites a Dispute',
 'This Will Change Your Mind']

### Cleaning the Data
Since we are dealing with a natural language processing (NLP) task, we need to process the text in a way that computers can understand it with tokenization - representing each word with a number.

In [9]:
# remove all headlines with the value of "Unknown"
all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

8603

In [10]:
all_headlines[:20]

['N.F.L. vs. Politics Has Been Battle All Season Long',
 'Voice. Vice. Veracity.',
 'A Stand-Up’s Downward Slide',
 'New York Today: A Groundhog Has Her Day',
 'A Swimmer’s Communion With the Ocean',
 'Trail Activity',
 'Super Bowl',
 'Trump’s Mexican Shakedown',
 'Pence’s Presidential Pet',
 'Fruit of a Poison Tree',
 'The Peculiar Populism of Donald Trump',
 'Questions for: ‘On Alaska’s Coldest Days, a Village Draws Close for Warmth’',
 'The New Kids',
 'What My Chinese Mother Made',
 'Do You Think Teenagers Can Make a Difference in the World?',
 'President Pledges to Let Politics Return to Pulpits',
 'The Police Killed My Unarmed Son in 2012. I’m Still Waiting for Justice.',
 'Video of Sheep Slaughtering Ignites a Dispute',
 'This Will Change Your Mind',
 'Busy Start for a President, and That Was in 1933']


We also want remove punctuation and make our sentences lowercase. This will decrease the number of unique words/tokens, and the model will be easier to train. Filtering the sentences for punctuation and uppercase letters can be done through Keras Tokenizer.

### Tokenization
With tokenization, a piece of text is separated into smaller chunks - in this case, words. Each unique word is assigned a number.

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer

# tokenize the words in the headlines
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1
print('Total words: ', total_words)

Total words:  11753


Can use word_index dictionary to see how the tokenizer saves the words.

In [14]:
# print a subset of the word_index dictionary created by Tokenizer
subset_dict = {key: value for key, value in tokenizer.word_index.items() \
                if key in ['a', 'man', 'a', 'plan', 'a', 'canal', 'panama']}
print(subset_dict)  # print a subset to visualize the tokenization

{'a': 2, 'plan': 82, 'man': 139, 'panama': 2931, 'canal': 5487}


Can use texts_to_sequences method to see how the tokenizer saves the words.

In [15]:
tokenizer.texts_to_sequences(['a', 'man', 'a', 'plan', 'a', 'canal', 'panama'])

[[2], [139], [2], [82], [2], [5487], [2931]]

### Creating Sequences
Now that the data is tokenized, we will need to create a sequence of tokens from the headlines. These sequences will be used to train the deep learning model.

In [22]:
# convert data to sequence of tokens
input_sequences = []
for line in all_headlines:
    # convert the headline into a sequence of tokens
    token_list = tokenizer.texts_to_sequences([line])[0]

    # create a series of sequences for each headline
    for i in range(1, len(token_list)):
        partial_sequence = token_list[:i+1]
        input_sequences.append(partial_sequence)

print(tokenizer.sequences_to_texts(input_sequences[:5]))
input_sequences[:5]

['n f', 'n f l', 'n f l vs', 'n f l vs politics', 'n f l vs politics has']


[[193, 125],
 [193, 125, 253],
 [193, 125, 253, 157],
 [193, 125, 253, 157, 226],
 [193, 125, 253, 157, 226, 83]]

### Padding Sequences
These sequences are of various lengths. For the model to be able to train on the data, all the sequences need to be the same length. This can be done by adding padding to the sequences with Keras pad_sequences method.

In [23]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# determine max sequence length
max_sequence_len = max([len(x) for x in input_sequences])

# pad all sequences with zeros at the beginning to make them all max length
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       193, 125], dtype=int32)

### Creating Predictors and Target
We need to split up the sequences into predictors and a target. The last word of the sequence will be the target, and the first words of the sequence will be the predictors. For example, in "nvidia releases ampere graphics cards", the predictors are "nvidia releases ampere graphics" and the target/label is "cards".

In [26]:
# predictors are every word except the last
predictors = input_sequences[:,:-1]
# labels are the last word
labels = input_sequences[:,-1]
labels[:5]

array([125, 253, 157, 226,  83], dtype=int32)

The targets are categorical. We are predicting one word out of our possible total vocabulary. Instead of the model predicting scalar numbers, we will have it predict binary categories.

In [27]:
from tensorflow.keras import utils

labels = utils.to_categorical(labels, num_classes=total_words)