# Headline Generation (NLP)
***
![Newspaper](Images/HeadlineImage.jpg)
***
In this tutorial we are going to use a recurrent neural network to look at headlines from newspapers, to then be able to generate new headlines based on the seed (first few words) that we give it. The data we have given you contains only American headlines, so it will be biased. We suggest trying your own data if you find something similar!
***
## Imports

In [1]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku

from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

import pandas as pd
import numpy as np
import string, os

Using TensorFlow backend.


## Load the Data
Here we have data define the exact file we want to load in as we only want the data that is associated with articles and not the comments.

To Do:
- Load all **article** data
- Take a look at the headlines that we have loaded

In [1]:
!ls data

[31mArticlesApril2017.csv[m[m [31mArticlesMarch2017.csv[m[m [31mCommentsFeb2018.csv[m[m
[31mArticlesApril2018.csv[m[m [31mArticlesMarch2018.csv[m[m [31mCommentsJan2017.csv[m[m
[31mArticlesFeb2017.csv[m[m   [31mArticlesMay2017.csv[m[m   [31mCommentsJan2018.csv[m[m
[31mArticlesFeb2018.csv[m[m   [31mCommentsApril2017.csv[m[m [31mCommentsMarch2017.csv[m[m
[31mArticlesJan2017.csv[m[m   [31mCommentsApril2018.csv[m[m [31mCommentsMarch2018.csv[m[m
[31mArticlesJan2018.csv[m[m   [31mCommentsFeb2017.csv[m[m   [31mCommentsMay2017.csv[m[m


In [3]:
curr_dir = 'data/'
all_headlines = []
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
        article_df = pd.read_csv(curr_dir + filename)
        all_headlines.extend(list(article_df.headline.values))
        break

all_headlines[:10]

['N.F.L. vs. Politics Has Been Battle All Season Long',
 'Voice. Vice. Veracity.',
 'A Stand-Up\xe2\x80\x99s Downward Slide',
 'New York Today: A Groundhog Has Her Day',
 'A Swimmer\xe2\x80\x99s Communion With the Ocean',
 'Trail Activity',
 'Super Bowl',
 'Trump\xe2\x80\x99s Mexican Shakedown',
 'Pence\xe2\x80\x99s Presidential Pet',
 'Fruit of a Poison Tree']

## Clean the Data
We need to clean the text of the data because it appears to be in unicode which is why we get apostrophe's appearing like "\xe2\x80\x99". We will define a function where we can pass all of our text through and which returns the text without any punctuation and capitol letters. 

To Do:
- Write a function which checks for punctuation, removes it and changes all letters to lower case
- Pass your data through the function

In [None]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.decode("ascii", "ignore")
    return txt.encode("utf8") 

corpus = [clean_text(x) for x in all_headlines]
corpus[:10]

## Tokenise
With the Twitter Classification we tokenised our words, we will do the same here and create a bag of words. A bag of words is something which counts the amount of occurences of given word and labels it with a unique identifier. 

To Do:
- Define a function to get a sequence of tokens
- Define a function to generate padded sequences

In [None]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenisation
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index)+1
    
    ## convert data to sequence of tokens
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

input_sequences, total_words = get_sequence_of_tokens(corpus)
input_sequences[:10]

In [None]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1], input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(input_sequences)

## Create the Model and Train
We're going to create a model using a Long Short Term Memory (LSTM) layer. Traditional neural networks usually throw away what they've learned previously and start over again. Recurrent Nueral Networks (RNN) are different, what they learn persists through each layer. A typical RNN can struggle to identify and learn the long term dependecies of the data. This is where a LSTM comes in as it is capable of learning long term dependencies. 

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

To Do:
- Create a model with the following layers:
            Embedding
            LSTM
            Dense
- Compile the model
- Train the model

In [None]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add input embedding layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add hidden layer 1 - LSTM layer
    model.add(LSTM(100))
    
    # Add output layer 
    model.add(Dense(total_words, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

In [None]:
model.fit(predictors, label, epochs=100, verbose=2)

## Generate your Headlines
To generate the headlines we are going to create a function which takes in the beggining of our headline (the topic), how long you want the headline to be, the model we created and how long we want our sequences to be. 

To Do:
- Create a generate_text function
- Generate different headlines 

In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [None]:
print (generate_text("united states", 5, model, max_sequence_len))
print (generate_text("preident trump", 4, model, max_sequence_len))
print (generate_text("donald trump", 4, model, max_sequence_len))
print (generate_text("india and china", 4, model, max_sequence_len))
print (generate_text("new york", 4, model, max_sequence_len))
print (generate_text("science and technology", 5, model, max_sequence_len))
print (generate_text("fake news", 5, model, max_sequence_len))