# Recurrent Neural Networks

Recurrent neural networks (RNN's) process sequences of data, while retaining a memory of the previous sequence. They can be used for predicting stock prices, sentences, and sensor measurements.

Recurrent means that the output at the current time step becomes part of the input of the next time step. I.e., the network not only considers its new input at each time step, but also its output at the previous time step. This memory allows the network to learn long-term dependencies of a data set. I.e., it can take context into account.

![title](1_NKhwsOYNUT5xU7Pyf6Znhg.png)

# LSTM

Long short-term memory (LSTM) is a popular recurrent neural network type, and we will be using it for the purpose of this notebook. A defining feature of an LSTM network is its "forget gate", which determines which states are remembered, and which are forgotten.

# Generating News Headlines

We will be using articles published by The New York Times to train a text generation model, which we will use to generate news headlines.

The data can be found here: https://www.kaggle.com/aashita/nyt-comments/data.

We delete all files that start with "Comments", since we're only interested in the articles.

## Import Libraries

In [1]:
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

import pandas as pd
import numpy as np
import string, os, glob

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

Using TensorFlow backend.


## Load Data

Combine all "Articles" files into one.

In [2]:
os.chdir(r"C:\Users\kaiaj_000\Documents\nyt-comments")

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

headlines = combined_csv["headline"].values
headlines = [h for h in headlines if h != "Unknown"]

headlines[:10]

['Finding an Expansive View  of a Forgotten People in Niger',
 'And Now,  the Dreaded Trump Curse',
 'Venezuela’s Descent Into Dictatorship',
 'Stain Permeates Basketball Blue Blood',
 'Taking Things for Granted',
 'The Caged Beast Awakens',
 'An Ever-Unfolding Story',
 'O’Reilly Thrives as Settlements Add Up',
 'Mouse Infestation',
 'Divide in G.O.P. Now Threatens Trump Tax Plan']

## Data Cleaning

Remove punctuation.

In [3]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in headlines]
corpus[:10]

['finding an expansive view  of a forgotten people in niger',
 'and now  the dreaded trump curse',
 'venezuelas descent into dictatorship',
 'stain permeates basketball blue blood',
 'taking things for granted',
 'the caged beast awakens',
 'an everunfolding story',
 'oreilly thrives as settlements add up',
 'mouse infestation',
 'divide in gop now threatens trump tax plan']

## Tokenization

Tokenization is the process of converting words into tokens. Each word is assigned an integer value, which is its token. 

After tokenizing, our article titles will be sequences of tokens, rather than sequences of words.

In [4]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
            
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[391, 17],
 [391, 17, 5166],
 [391, 17, 5166, 523],
 [391, 17, 5166, 523, 4],
 [391, 17, 5166, 523, 4, 2],
 [391, 17, 5166, 523, 4, 2, 1601],
 [391, 17, 5166, 523, 4, 2, 1601, 134],
 [391, 17, 5166, 523, 4, 2, 1601, 134, 5],
 [391, 17, 5166, 523, 4, 2, 1601, 134, 5, 1951],
 [7, 57]]

## Padding and Variables

Since our sequences are of different lengths, we must convert them to be the same length, so that they can be inputed into our model.

We must also give our model predictors and labels, so that it can learn what word to predict after a word or sequence of words.

PREDICTORS	                   LABEL
they	                       are
they are	                   learning
they are learning	           data
they are learning data	       science

In [5]:
def generate_padded_sequences(input_sequences):
    
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

## Building the LSTM

The architecture of our LSTM will be as follows:
1. Input Layer: Takes the sequence of words as input.
2. LSTM Layer: Computes the output using LSTM units.
3. Dropout Layer: A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting.
4. Output Layer: Computes the probability of the best possible next word as output.

In [6]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 23, 10)            112650    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               44400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 11265)             1137765   
Total params: 1,294,815
Trainable params: 1,294,815
Non-trainable params: 0
_________________________________________________________________


In [7]:
model.fit(predictors, label, epochs=100, verbose=5)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.callbacks.History at 0x1cfd6ff128>

## Text Generation

We will write a function that will predict the next word from a sequence of input text. The multiple predicted words will be appended together to get a predicted sequence.

In [8]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
        
    return seed_text.title()

print (generate_text("united states", 5, model, max_sequence_len))
print (generate_text("preident trump", 4, model, max_sequence_len))
print (generate_text("donald trump", 4, model, max_sequence_len))
print (generate_text("india and china", 4, model, max_sequence_len))
print (generate_text("new york", 4, model, max_sequence_len))
print (generate_text("science and technology", 5, model, max_sequence_len))

United States Pushes Us Maps The Safety
Preident Trump Is A Racist Period
Donald Trump Vs The Food Snobs
India And China For Fine Cause Becomes
New York Today A Noreaster Nears
Science And Technology A First Pet And Urgent


Ha ha! Some obvious nonsense is present here, but we could argue that at least the third article title makes some sense. Also, the second article title mentions president Trump and racism, which, whether incorrectly or correctly, has been associated with him in the past.

## Improvement

Ideally, we would want to generate titles that could fool someone into thinking they were actually written by a human, which would not be the case with most of the articles generated above.

Tuning the network architecture and parameters is always a good suggestion, but in this case, the best way to imporve the performance would be by getting more data. 