# News headline text generation model

Here, I create a model that generates news headlines using the Kaggle dataset of New York Times comments and headlines. 

## 1) Import the necessary libraries.


In [179]:
# MISC data science libraries
import datetime as dt
import pandas as pd
import numpy as np
import string, os
import requests # for accesing the NYT API
import pickle
from pickle import load


# keras module for preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras.utils as ku 
# keras module for building LSTM 
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
# keras module for LSTM training
from keras.models import load_model
from keras.callbacks import EarlyStopping, ModelCheckpoint


# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)


# import warnings
# warnings.filterwarnings("ignore")
# warnings.simplefilter(action='ignore', category=FutureWarning)

## 2) Load data

### 1. Load the [kaggle dataset](https://www.kaggle.com/aashita/nyt-comments) of New York Times comments and headlines
The NYT dataset downloaded from Kaggle contains eight months of data, from the months January, February, March, and April of 2017 and 2018. 

In [3]:
# explore the dataset using just one month of data

April2018_headlines = pd.read_csv('/Users/laurenfinkelstein/take_home_tests/SynapseFI/data/ArticlesApril2018.csv')
April2018_headlines.head()

Unnamed: 0,articleID,articleWordCount,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL
0,5adf6684068401528a2aa69b,781,By JOHN BRANCH,article,Former N.F.L. Cheerleaders’ Settlement Offer: ...,"['Workplace Hazards and Violations', 'Football...",68,Sports,0,2018-04-24 17:16:49,Pro Football,"“I understand that they could meet with us, pa...",The New York Times,News,https://www.nytimes.com/2018/04/24/sports/foot...
1,5adf653f068401528a2aa697,656,By LISA FRIEDMAN,article,E.P.A. to Unveil a New Rule. Its Effect: Less ...,"['Environmental Protection Agency', 'Pruitt, S...",68,Climate,0,2018-04-24 17:11:21,Unknown,The agency plans to publish a new regulation T...,The New York Times,News,https://www.nytimes.com/2018/04/24/climate/epa...
2,5adf4626068401528a2aa628,2427,By PETE WELLS,article,"The New Noma, Explained","['Restaurants', 'Noma (Copenhagen, Restaurant)...",66,Dining,0,2018-04-24 14:58:44,Unknown,What’s it like to eat at the second incarnatio...,The New York Times,News,https://www.nytimes.com/2018/04/24/dining/noma...
3,5adf40d2068401528a2aa619,626,By JULIE HIRSCHFELD DAVIS and PETER BAKER,article,Unknown,"['Macron, Emmanuel (1977- )', 'Trump, Donald J...",68,Washington,0,2018-04-24 14:35:57,Europe,President Trump welcomed President Emmanuel Ma...,The New York Times,News,https://www.nytimes.com/2018/04/24/world/europ...
4,5adf3d64068401528a2aa60f,815,By IAN AUSTEN and DAN BILEFSKY,article,Unknown,"['Toronto, Ontario, Attack (April, 2018)', 'Mu...",68,Foreign,0,2018-04-24 14:21:21,Canada,"Alek Minassian, 25, a resident of Toronto’s Ri...",The New York Times,News,https://www.nytimes.com/2018/04/24/world/canad...


In [55]:
# method from Kaggle kernel
# load all eight months of data available through Kaggle

curr_dir = '/Users/laurenfinkelstein/take_home_tests/SynapseFI/data/'
kaggle_headlines = []
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
        article_df = pd.read_csv(curr_dir + filename)
        kaggle_headlines.extend(list(article_df.headline.values))
        break

kaggle_headlines = [h for h in kaggle_headlines if h != "Unknown"]
len(kaggle_headlines)

831

In [5]:
# William method? using glob?

### 2. Augment the data obtained from Kaggle with more data using the New York Times API

After first running the model using just the data obtained from Kaggle, I saw that the results were unsatisfactory. This may be due to the fact that the Kaggle dataset only includes eight months worth of data (Jan, Feb, March, and April of 2017 and 2018), totaling only 831 headlines. To improve the model, I've augmented the dataset with headlines from the months of January 2019 through April 2019 using data obtained directly from the New York Times API. 

In [56]:
def get_NYT_headlines(year, month):
    
    '''
     This function makes a request to the New York Times Archive API and collects 
     a list of article headlines for the specified month and year of interest.
    '''

    api_key = {'api-key' : '6dZ0JG8qQwUG4Wfs6JuukwOhw2SJxZ1f'}
    url = 'https://api.nytimes.com/svc/archive/v1/' + str(year) + '/' + str(month) + '.json'
    
    response = requests.get(url, params=api_key)
    output = response.json()
    
    docs = output['response']['docs']
    
    headlines = []
    for doc in docs:
        headlines.append(doc['headline']['main'])
    
    return headlines

In [57]:
# testing the above function using the month December, 2018

dec_2018_headlines = get_NYT_headlines(2018, 12)
dec_2018_headlines[:10]

['The People ‘Are Hungry’: Scenes from the ‘Yellow Vest’ Protests in Paris',
 'The Marvelous Mr. Mackie',
 'Brazil’s New Leader Wants to Ease Gun Laws. Supporters Are Ready, and Training.',
 'George H.W. Bush, Public Servant',
 'Politicians and Family React to George Bush’s Death',
 'How George Bush Befriended Dana Carvey, the ‘S.N.L.’ Comedian Who Impersonated Him',
 'Quotation of the Day: A Peaceful Exit, but First, One Last ‘I Love You, Too’',
 'The Most Wonderful Smelling Time of the Year',
 'Michelle Bachelet: Ignore Climate Change at Your Peril',
 'N.Y. Today: Protecting Rent-Stabilized Tenants From Shady Landlords']

In [60]:
def date_range_len(start_yr, start_mth, end_yr, end_mth):
    months = (end_yr - start_yr)*12 + (end_mth - start_mth) + 1
    return months

In [62]:
def get_date_df(months, start_yr, start_mth):
    yr_list = []
    mth_list = []
    date_list = []
    curr_yr = start_yr
    curr_mth = start_mth
    for i in range(months):
        yr_list.append(curr_yr)
        mth_list.append(curr_mth)
        date_list.append(dt.date(curr_yr, curr_mth, 1))
        if curr_mth < 12:
            curr_mth += 1
        else:
            curr_mth = 1
            curr_yr += 1
    date_df = pd.DataFrame(yr_list, columns=['year'])
    date_df['month'] = mth_list
    date_df['date'] = date_list
    return date_df

In [61]:
# download data from January 2019 through April 2019

start_yr = 2019
start_mth = 1
end_yr = 2019
end_mth = 4

months = date_range_len(start_yr, start_mth, end_yr, end_mth)
months

4

In [63]:
date_df = get_date_df(months, start_yr, start_mth)
date_df.head()

Unnamed: 0,year,month,date
0,2019,1,2019-01-01
1,2019,2,2019-02-01
2,2019,3,2019-03-01
3,2019,4,2019-04-01


In [94]:
NYT_headlines = []
for i, yr in enumerate(date_df['year']):
    print(yr, date_df['month'][i])
    headlines = get_NYT_headlines(yr, date_df['month'][i])
#     headlines_list_2019.append(headlines)
    NYT_headlines.extend(list(headlines))

2019 1
2019 2
2019 3
2019 4


In [95]:
len(NYT_headlines) 

# that's better; now we have a lot more data to train our model with 
# and can append this to the data downloaded from Kaggle

27822

In [97]:
NYT_headlines[:5]

['Daryl Dragon, of the Captain and Tennille Pop Duo, Dies at 76',
 'Where Doulas Calm Nerves and Bridge Cultures During Childbirth',
 'Voting Issues and Gerrymanders Are Now Key Political Battlegrounds',
 'Protecting Pregnant Workers',
 'When Louis C.K. Crossed a Line']

In [100]:
# there are a lot more headlines from the four months of data downloaded using the NYT API
# than there are in the dataset downloaded from Kaggle.
# Are there duplicates?

len(NYT_headlines) == len(set(NYT_headlines))
# len(headlines_2019) - len(set(headlines_2019))

False

In [102]:
# looks like there are many (12,750) duplicate headlines in headlines_2019; let's get rid of those

seen = set()
NYT_headlines_uniq = []
for headline in NYT_headlines:
    if headline not in seen:
        NYT_headlines_uniq.append(headline)
        seen.add(headline)

In [103]:
len(NYT_headlines_uniq) == len(set(NYT_headlines_uniq))

True

In [104]:
# Looks like there are also some duplicates in the Kaggle dataset, but only 2

len(kaggle_headlines) == len(set(kaggle_headlines))
len(kaggle_headlines) - len(set(kaggle_headlines))

2

### 3. Merge Kaggle data with NYT Archive data

In [108]:
# merge Kaggle data with NYT Archive data

all_headlines = kaggle_headlines + NYT_headlines_uniq
len(all_headlines)

15903

Now that we've merged the headlines downloaded from Kaggle and the headlines downloaded using the NYT API together, we need to check for duplicates and remove those from our list of headlines we will use to train the model. 

In [109]:
# there are 3 duplicates in our all_headlines list

len(all_headlines) == len(set(all_headlines))
len(all_headlines) - len(set(all_headlines))

3

In [110]:
# let's get rid of the duplicates in all_headlines

seen2 = set()
all_headlines_uniq = []
for headline in all_headlines:
    if headline not in seen2:
        all_headlines_uniq.append(headline)
        seen2.add(headline)

In [111]:
# great, no more duplicate headlines

len(all_headlines_uniq)

15900

It looks like some of the headlines in this list are simply empty strings. We will remove these from the list. 

In [114]:
# looks like there is an empty string in all_headlines_uniq

'' in all_headlines_uniq

True

In [115]:
# let's get rid of that empty string

all_headlines_uniq.remove('')

In [116]:
'' in all_headlines_uniq

False

In [150]:
# looks like there are null values too

None in all_headlines_uniq

True

In [151]:
# remove null values

all_headlines_uniq.remove(None)

In [152]:
None in all_headlines_uniq

False

In [153]:
len(all_headlines_uniq)

15898

## 3) Preprocess the data

### 1. Clean and tokenize the data

Preprocessing the data consists of both cleaning and tokenizing the data. To clean the data, we will remove punctuation and lowercase all words in the corpus. We do not need to worry about removing stop words, as we do in many NLP projects, because we want to the model to generate fluid headlines similar to those that would be created by a human. 

We will clean and tokenize the corpus in the same step, using the Keras library's Tokenizer method. The tokenizer removes all punctuation as well as tabs and linebreaks, converts text to lowercase, and tokenizes/vectorizes all text in the corpus by turning each text into a sequence of integers. 

Tokenization is necessary for preparing data for embedding layer (see model architecture section below)

In [164]:
corpus = all_headlines_uniq

# max_words = 50000 # Max size of the dictionary (from Jeremy)

# For simplicity, one "sentence" per line 
# corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2] # from Metis Deep Learning

# Clean and tokenize using Keras' built-in Tokenizer() method
tokenizer = Tokenizer() # create the tokenizer
tokenizer.fit_on_texts(corpus) # fit the tokenizer on the documents

# Convert tokenized texts to sequence format
sequences = tokenizer.texts_to_sequences(corpus)
print(sequences[:5])
nb_samples = sum(len(s) for s in corpus) # number of samples total in sequences; 
print(nb_samples)

# Vocab size ## FROM METIS DEEP LEARNING
V = len(tokenizer.word_index) + 1 # total words (WHY DO WE ADD 1??); word index gives a dictionary of words and their uniquely assigned integers
print(V)

# Dimension to reduce to
dim = 100
window_size = 2
# sequences

[[590, 21, 4191, 675, 5, 2, 2704, 163, 4, 3059], [6, 61, 1, 4192, 13, 4193], [613, 9767, 73, 4194], [6588, 9768, 769, 1569, 925], [614, 377, 7, 2435]]
869840
20387


In [165]:
## convert data to sequence of tokens

# sequence of n-grams from all documents in the corpus
input_sequences = [] 

for headline in corpus:
    token_list = tokenizer.texts_to_sequences([headline])[0] # list of tokens corresponding to each word in the document (i.e., news headline)
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1] # creates n-gram sequence (from 2 word n-grams, through n-grams = len(line))
        input_sequences.append(n_gram_sequence)

input_sequences[:5], V
        
# print(input_sequences1, V)

# inp_sequences, total_words = get_sequence_of_tokens(corpus)
# inp_sequences[:10]

([[590, 21],
  [590, 21, 4191],
  [590, 21, 4191, 675],
  [590, 21, 4191, 675, 5],
  [590, 21, 4191, 675, 5, 2]],
 20387)

### Clean the data

- remove punctuation
- lowercase all words

### 2. Generate sequence of n-gram tokens

Language modelling takes in sequential data (i.e., words/tokens), and uses this sequential data to predict the next word/token. This requires tokenization, which is the process of extracting tokens (i.e., terms/words) from a corpus. To do this, I use the Python library Keras' built-in model for tokenization, which is able to obtain the tokens and their respsective index in the corpus. This allows for every text document in the dataset to be converted into a sequence of tokens.

Each list in the above output represents the ngram phrases generated from the input data (i.e., the documents/headlines). Each integer in these n-grams is the index of the given word in the complete vocabulary of words present in the corpus of text.

For example, the first headline in the corpus is "Finding an expansive view of a forgotten people in niger". For this headline, we see the following output, i.e. **sequences of tokens**:

    [169, 17],
    [169, 17, 665],
    [169, 17, 665, 367],
    [169, 17, 665, 367, 4],
    [169, 17, 665, 367, 4, 2],
    [169, 17, 665, 367, 4, 2, 666],
    [169, 17, 665, 367, 4, 2, 666, 170],
    [169, 17, 665, 367, 4, 2, 666, 170, 5],
    [169, 17, 665, 367, 4, 2, 666, 170, 5, 667]

which, respectively, equate to the following **n-grams**: 

    Finding an,
    Finding an expansive,
    Finding an expansive view,
    Finding an expansive view of,
    Finding an expansive view of a,
    Finding an expansive view of a forgotten,
    Finding an expansive view of a forgotten people,
    Finding an expansive view of a forgotten people in,
    Finding an expansive view of a forgotten people in niger

### 3. Pad the sequences and obtain variables (predictor and target)

Up to this point, we've generated a data-set containing a sequence of tokens. It's important to recognize that different sequences may have different lengths. Before training the model, we must ensure all sequence lengths are equal. For this, we can pad the sequences using the Keras pad_sequence function. 

In [166]:
## pad the sequences (CODE COMES FROM KAGGLE GUY)

# find length of the longest input sequence (i.e., headline)
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')) # pads the input sequences and converts to a numpy array (must fit model on np arrays)

Additionally, inorder to input data and train the model, we must create predictors and a label. We will use a n-grams sequence as predictors, and the next word of the n-gram as the label. 

For example, for the headline "Finding an expansive view of a forgotten people in niger", the **predictors** will be:

    1. Finding an
    2. Finding an expansive
    3. Finding an expansive view

and the corresponding **labels** will be

    1. expansive
    2. view
    3. of

... and so on. 

In [176]:
## create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1] # label is the last word of each n-gram, and predictors are all the preceeding words
label = ku.to_categorical(label, num_classes=V) # one-hot encodes the labels; this is a matrix of 4,806 rows (the number of input sequences) and 2,422 columns (the number of total words/tokens in the word dictionary)

Now we can obtain the input vector *X* and the label vector _Y_, which will be used for to training the network. 

## 4) Build and train the model

We will use a LSTM model with a three layer architecture, including:
    
***Input Layer*** : Takes the sequence of words as input
<br /> ***LSTM Layer*** : Computes the output using LSTM units. There are currently 100 units, but this can be fine-tuned later.
<br /> ***Dropout Layer*** : A regularization layer which randomly turns off the activations of some neurons in the LSTM layer in order to prevent overfitting. (Note: this is an optional layer)
<br /> ***Output Layer*** : Computes the probability of the best possible next word as output
    
We will run this model for  100 epochs, but this can be further experimented with and fine-tuned. 

In [None]:
## create the model

def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1 # - 1 because the last word in the sequence is used as the label (?)
    
    ## Create a Keras Sequential model 
    model = Sequential()
    
    ## Specify the network architecture
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    # Compile the network (sets up training parameters before training)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

In [None]:
# View the network graphically
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

Now that we've created the model architecture, we can train it using our data (predictors and labels). 

In [None]:
## train the model

model.fit(predictors, label, epochs=100, verbose=5)

## 5) Use the model to generate text

Now that the model has been trained, we can use it to predict the next word based on input words (i.e., seed text). 

To feed this seed text into the model, we will need to do some pre-processing, which includes (1) tokenizing the seed text and (2) padding the sequences. We can then pass the sequences into the trained model to get the predicted word. The multiple predicted words can then be appended together to obtain a predicted sequence. 

In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0] # list of tokens corresponding to each word in the document (i.e., news headline)
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') # add padding
        predicted = model.predict_classes(token_list, verbose=0) # make predictions using trained model
        
        output_word = ""
        for word,index in tokenizer.word_index.items(): # word index gives a dictionary of words and their uniquely assigned integers
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

Here are some results:

In [None]:
print (generate_text("united states", 5, model, max_sequence_len))
print (generate_text("president trump", 4, model, max_sequence_len))
print (generate_text("donald trump", 4, model, max_sequence_len))
print (generate_text("india and china", 4, model, max_sequence_len))
print (generate_text("new york", 4, model, max_sequence_len))
print (generate_text("science and technology", 5, model, max_sequence_len))

In [None]:
# Evaluate the model (?)

loss_and_metrics = model.evaluate(X_test, y_test, batch_size=32)
print('\n', loss_and_metrics)

# Improvement Ideas
As we can see, the model has produced the output which looks fairly fine. The results can be improved further with following points:

- Adding more data
- Fine Tuning the network architecture
- Fine Tuning the network parameters