# Headlines Generator

We will be implementing a text predictor model and using it to generate headlines

### Objectives

* We will prepare the sequence data to be used in a LSTM (special type of RNN)
* We will build and train a model to perform word prediction

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import os
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import utils
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential

In [2]:
articles_dir = 'Articles/'

all_headlines = []

for fn in os.listdir(articles_dir):
    if 'Articles' in fn:
        df = pd.read_csv(articles_dir + fn) #reading  all the headlines from our csv file
        all_headlines.extend(list(df.headline.values))

In [3]:
print(len(all_headlines))

9335


In [4]:
all_headlines[:20]

['Finding an Expansive View  of a Forgotten People in Niger',
 'And Now,  the Dreaded Trump Curse',
 'Venezuela’s Descent Into Dictatorship',
 'Stain Permeates Basketball Blue Blood',
 'Taking Things for Granted',
 'The Caged Beast Awakens',
 'An Ever-Unfolding Story',
 'O’Reilly Thrives as Settlements Add Up',
 'Mouse Infestation',
 'Divide in G.O.P. Now Threatens Trump Tax Plan',
 'Variety Puzzle: Acrostic',
 'They Can Hit a Ball 400 Feet. But Play Catch? That’s Tricky.',
 'In Trump Country, Shock at Trump Budget Cuts',
 'Why Is This Hate Different From All Other Hate?',
 'Pick Your Favorite Ethical Offender',
 'My Son’s Growing Black Pride',
 'Jerks and the Start-Ups They Ruin',
 'Trump  Needs  a Brain',
 'Manhood in the Age of Trump',
 'The Value of a Black College']

### Data cleaning

In [5]:
#Counting the headlines labled as "Unknown"
unknown_count = 0;
for line in all_headlines:
    if line=='Unknown':
        unknown_count += 1
print(unknown_count)

732


In [6]:
#Removing the "Unknown" headlines
all_headlines = [ line for line in all_headlines if line != "Unknown" ]

In [7]:
print(len(all_headlines))  #9335 - 732 = 8603

8603


In [8]:
all_headlines[:20]

['Finding an Expansive View  of a Forgotten People in Niger',
 'And Now,  the Dreaded Trump Curse',
 'Venezuela’s Descent Into Dictatorship',
 'Stain Permeates Basketball Blue Blood',
 'Taking Things for Granted',
 'The Caged Beast Awakens',
 'An Ever-Unfolding Story',
 'O’Reilly Thrives as Settlements Add Up',
 'Mouse Infestation',
 'Divide in G.O.P. Now Threatens Trump Tax Plan',
 'Variety Puzzle: Acrostic',
 'They Can Hit a Ball 400 Feet. But Play Catch? That’s Tricky.',
 'In Trump Country, Shock at Trump Budget Cuts',
 'Why Is This Hate Different From All Other Hate?',
 'Pick Your Favorite Ethical Offender',
 'My Son’s Growing Black Pride',
 'Jerks and the Start-Ups They Ruin',
 'Trump  Needs  a Brain',
 'Manhood in the Age of Trump',
 'The Value of a Black College']

In [9]:
#tokenising our words in all the headlines
#the Tokenizer class will take care of removing the punctuations and converting the words to lowercase
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_num_of_words = len(tokenizer.word_index) +1

In [10]:
print("Total number of words: ", total_num_of_words)

Total number of words:  11753


In [11]:
tokenizer.word_index

{'the': 1,
 'a': 2,
 'to': 3,
 'of': 4,
 'in': 5,
 'for': 6,
 'and': 7,
 'on': 8,
 'is': 9,
 'trump': 10,
 'with': 11,
 'new': 12,
 'at': 13,
 'how': 14,
 'what': 15,
 'you': 16,
 'an': 17,
 'from': 18,
 'as': 19,
 'it': 20,
 'trump’s': 21,
 'your': 22,
 'are': 23,
 'not': 24,
 'be': 25,
 'season': 26,
 's': 27,
 'u': 28,
 'that': 29,
 'i': 30,
 'by': 31,
 'about': 32,
 'but': 33,
 'episode': 34,
 'can': 35,
 'do': 36,
 'up': 37,
 'when': 38,
 'york': 39,
 'over': 40,
 'this': 41,
 'out': 42,
 'no': 43,
 '’': 44,
 'why': 45,
 'more': 46,
 'p': 47,
 '‘the': 48,
 'after': 49,
 'o': 50,
 'will': 51,
 'my': 52,
 'may': 53,
 'it’s': 54,
 'or': 55,
 'health': 56,
 'war': 57,
 'who': 58,
 'his': 59,
 'we': 60,
 'its': 61,
 'teaching': 62,
 'questions': 63,
 'g': 64,
 'president': 65,
 'was': 66,
 'house': 67,
 'one': 68,
 'have': 69,
 '1': 70,
 'should': 71,
 'get': 72,
 'today': 73,
 'into': 74,
 'all': 75,
 'now': 76,
 '2': 77,
 'life': 78,
 'home': 79,
 'our': 80,
 'don’t': 81,
 'plan': 82

In [12]:
#creating smaller dictionary to visualize tokenization
small_dictionary = { key : value for key, value in tokenizer.word_index.items() \
                    if key in ['the', 'plan', 'is', 'to', 'play', 'eat','sleep', 'repeat']}

In [13]:
print(small_dictionary)

{'the': 1, 'to': 3, 'is': 9, 'plan': 82, 'eat': 247, 'play': 330, 'sleep': 787, 'repeat': 3226}


In [14]:
tokenizer.texts_to_sequences(['the','plan', 'is', 'to', 'play', 'eat','sleep', 'repeat'])

[[1], [82], [9], [3], [330], [247], [787], [3226]]

### Creating sequence of tokens for training

In [15]:
#our model will understand a sequence of tokens(in numbers) instead of the actual word itself

all_sequences = []

for line in all_headlines:
    sequence_of_tokens = tokenizer.texts_to_sequences( [line] )[0] #converting the headline into a sequence of tokens
    
    for i in range(1, len(sequence_of_tokens)):
        partial_sequence = sequence_of_tokens[:i+1]
        all_sequences.append(partial_sequence)

In [16]:
print(tokenizer.sequences_to_texts(all_sequences[:5]))

['finding an', 'finding an expansive', 'finding an expansive view', 'finding an expansive view of', 'finding an expansive view of a']


In [17]:
all_sequences[:5]

[[403, 17],
 [403, 17, 5242],
 [403, 17, 5242, 543],
 [403, 17, 5242, 543, 4],
 [403, 17, 5242, 543, 4, 2]]

### Padding our seqeunces

Padding our sequences to same length in order to train our model on the data

In [18]:
max_seq_len = max([len(line) for line in all_sequences])

In [19]:
all_sequences = np.array(pad_sequences(all_sequences, maxlen = max_seq_len, padding = 'pre'))
all_sequences[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       403,  17])

In [20]:
max_seq_len

28

In [21]:
## WE will be using the predictors to predict the target words(labels, the last word in each sequence)

# Predictors are every word except the last
predictors = all_sequences[:,:-1]
# Labels are the last word
labels = all_sequences[:,-1]
labels[:5]

array([  17, 5242,  543,    4,    2])

In [22]:
labels = utils.to_categorical(labels, num_classes=total_num_of_words)

In [23]:
labels

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [24]:
#Our input length is  max_seq_len - 1, as the last word happens to be the label
input_len = max_seq_len - 1 

model = Sequential()
#Adding the input embedding layer with embeddings dim = 10
model.add(Embedding(total_num_of_words, 10, input_length=input_len))
#Adding a LSTM layer with 100 units
model.add(LSTM(100))
model.add(Dropout(0.1))
#Adding the output layer
model.add(Dense(total_num_of_words, activation='softmax'))

In [25]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 27, 10)            117530    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               44400     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 11753)             1187053   
Total params: 1,348,983
Trainable params: 1,348,983
Non-trainable params: 0
_________________________________________________________________


In [26]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [27]:
model.fit(predictors, labels, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x1df85f707b8>

### Predicting the next word

In [28]:
def predict_next_token(texts):
    token_sequence = tokenizer.texts_to_sequences([texts])[0]
    token_sequence = pad_sequences([token_sequence], maxlen = max_seq_len -1, padding = 'pre')
    prediction = model.predict_classes(token_sequence, verbose=0)
    return prediction

In [29]:
# next_token = predict_next_token("the fear of school")
next_token = predict_next_token("today in new york")
next_token



array([1442], dtype=int64)

In [30]:
next_word = tokenizer.sequences_to_texts([next_token])
next_word

['adds']

### Generating New Headines!!!!!!!

We will make use of the previous function and apply it to predict more than one word!

In [31]:
def headlines_generator(sample_texts, next_words = 1):
    for i in range(next_words):
        next_token = predict_next_token(sample_texts)
        next_word = tokenizer.sequences_to_texts([next_token])[0]
        sample_texts += " "+ next_word
    return sample_texts.title()

In [32]:
sample_texts = ['the fear of school', 'yesterday was a', 'washington dc is', 'today in new york', 'the school district has', 'crime has become', 'in recent news', 'trump has demanded', 'violence is not', 'sports and education must', 'music has become']

In [33]:
for texts in sample_texts:
    print(headlines_generator(texts, next_words = 6))

The Fear Of School Honesty Gets Demolition Seems To The
Yesterday Was A ‘Regret Clause’ A Story And The
Washington Dc Is The New York Hotel A Last
Today In New York Adds Trump Stamps And A Lot
The School District Has The Grid And Luxurious Spending District
Crime Has Become A Pawn To A Lift Israeli
In Recent News And Memories To Cover Homeless To
Trump Has Demanded Access Of ‘Roseanne’ And Answers To
Violence Is Not To Confront A Appetite Of Populism
Sports And Education Must Don’T Don’T Don’T Be Easy Books
Music Has Become A Cliff War Of A Changing
