# Word Prediction
## Build and train a model to perform word prediction

## Reading in the Data
Our dataset consists of headlines from the New York Times newspaper over the course of several months. We'll start by reading in all the headlines from the articles. The articles are in CSV files, so we can use pandas to read them in.

In [1]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras import utils
import pandas as pd
import numpy as np
import os 

In [2]:
nyt_dir = 'data/nyt_dataset/articles/'

all_headlines = []
for filename in os.listdir(nyt_dir):
    if 'Articles' in filename:
        # Read in all the data from the CSV file
        headlines_df = pd.read_csv(nyt_dir + filename)
        # Add all of the headlines to our list
        all_headlines.extend(list(headlines_df.headline.values))
len(all_headlines)

9335

In [3]:
all_headlines[:20]

['My Beijing: The Sacred City',
 '6 Million Riders a Day, 1930s Technology',
 'Seeking a Cross-Border Conference',
 'Questions for: ‘Despite the “Yuck Factor,” Leeches Are Big in Russian Medicine’',
 'Who Is a ‘Criminal’?',
 'An Antidote to Europe’s Populism',
 'The Cost of a Speech',
 'Degradation of the Language',
 'On the Power of Being Awful',
 'Trump Garbles Pitch on a Revised Health Bill',
 'What’s Going On in This Picture? | May 1, 2017',
 'Unknown',
 'When Patients Hit a Medical Wall',
 'Unknown',
 'For Pregnant Women, Getting Serious About Whooping Cough',
 'Unknown',
 'New York City Transit Reporter in Wonderland: Riding the London Tube',
 'How to Cut an Avocado Without Cutting Yourself',
 'In Fictional Suicide, Health Experts Say They See a Real Cause for Alarm',
 'Claims of Liberal Media Bias Hit ESPN, Too']

## Cleaning the Data
1. Remove all headlines with the value of "Unknown"
2. Remove punctuation and set all sentences to lower case. (easier to train. For our purposes, there is little or no difference between a line ending with "!" or "?" or whether words are capitalized.)
3. Tokenization: 
    
    a. Separate a piece of text into smaller chunks (tokens), which in this case are words.
    
    b.take each of the words that appears in our dataset and represent it with a number.



In [4]:
# Remove all headlines with the value of "Unknown"
all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

# Tokenize the words in our headlines
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1
print('Total words: ', total_words)

Total words:  11753


In [5]:
# Print a subset of the word_index dictionary created by Tokenizer
subset_dict = {key: value for key, value in tokenizer.word_index.items() \
               if key in ['a','man','a','plan','a','canal','panama']}
print(subset_dict)

{'a': 2, 'plan': 82, 'man': 139, 'panama': 2732, 'canal': 7047}


In [6]:
# See how the tokenizer saves the words:
tokenizer.texts_to_sequences(['a','man','a','plan','a','canal','panama'])

[[2], [139], [2], [82], [2], [7047], [2732]]

## Creating Sequences
Now that we've tokenized the data, we will create sequences of tokens from the headlines. 

These sequences are what we will train our deep learning model on.

"nvidia launches ray tracing gpus"

nvidia - 5, launches - 22, ray - 94, tracing - 16, gpus - 102. 

The full sequence would be: [5, 22, 94, 16, 102]. 

In [7]:
# Convert data to sequence of tokens 
input_sequences = []
for line in all_headlines:
    # Convert our headline into a sequence of tokens
    token_list = tokenizer.texts_to_sequences([line])[0]
    
    # Create a series of sequences for each headline
    for i in range(1, len(token_list)):
        partial_sequence = token_list[:i+1]
        input_sequences.append(partial_sequence)

print(tokenizer.sequences_to_texts(input_sequences[:5]))
input_sequences[:5]

['my beijing', 'my beijing the', 'my beijing the sacred', 'my beijing the sacred city', '6 million']


[[52, 1616],
 [52, 1616, 1],
 [52, 1616, 1, 1992],
 [52, 1616, 1, 1992, 125],
 [126, 346]]

## Padding Sequences
Right now our sequences are of various lengths.

For our model to be able to train on the data, we need to make all the sequences the same length.

In [8]:
# Determine max sequence length
max_sequence_len = max([len(x) for x in input_sequences])

# Pad all sequences with zeros at the beginning to make them all max length
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,   52, 1616], dtype=int32)

## Creating Predictors and Target
Split up the sequences into predictors and a target.

The last word of the sequence will be our target, and the first words of the sequence will be our predictors.

As an example, take a look at the full headline: "nvidia releases ampere graphics cards"
<table>
<tr><td>PREDICTORS </td> <td>           TARGET </td></tr>
<tr><td>nvidia                   </td> <td>  releases </td></tr>
<tr><td>nvidia releases               </td> <td>  ampere </td></tr>
<tr><td>nvidia releases ampere      </td> <td>  graphics</td></tr>
<tr><td>nvidia releases ampere graphics </td> <td>  cards</td></tr>
</table>

In [9]:
# Predictors are every word except the last
predictors = input_sequences[:,:-1]
# Labels are the last word
labels = input_sequences[:,-1]
labels[:5]

array([1616,    1, 1992,  125,  346], dtype=int32)

In [10]:
# Like our earlier sections, these targets are categorical.
# We are predicting one word out of our possible total vocabulary. 
# Instead of the network predicting scalar numbers, we will have it predict binary categories.
labels = utils.to_categorical(labels, num_classes=total_words)

## Creating the Model
* Using new layers to deal with our sequential data.
1. Embedding Layer - take the tokenized sequences and will learn an embedding for all of the words in the training dataset. Mathematically, embeddings work the same way as a neuron in a neural network, but conceptually, their goal is to reduce the number of dimensions for some or all of the features. In this case, it will represent each word as a vector, and the information within that vector will contain the relationships between each word.
2. LSTM - Very important layer, is a long short term memory layer (LSTM). An LSTM is a type of recurrent neural network or RNN. Unlike traditional feed-forward networks that we've seen so far, recurrent networks have loops in them, allowing information to persist. New information (x) gets passed in to the network, which spits out a prediction (h). Additionally, information from that layer gets saved, and used as input for the next prediction. This may seem a bit complicated, but let's look at it unrolled.We can see that when a new piece of data (x) is fed into the network, that network both spits out a prediction (h) and also passes some information along to the next layer. That next layer gets another piece of data, but gets to learn from the layer before it as well. Traditional RNNs suffer from the issue of more recent information contributing more than information from further back. LSTMs are a special type of recurrent layer that are able to learn and retain longer term information. 

In [11]:
# Input is max sequence length - 1, as we've removed the last word for the label
input_len = max_sequence_len - 1 

model = Sequential()

# Add input embedding layer
model.add(Embedding(total_words, 10, input_length=input_len))

# Add LSTM layer with 100 units
model.add(LSTM(100))
model.add(Dropout(0.1))

# Add output layer
model.add(Dense(total_words, activation='softmax'))

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 27, 10)            117530    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               44400     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 11753)             1187053   
Total params: 1,348,983
Trainable params: 1,348,983
Non-trainable params: 0
_________________________________________________________________


In [13]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

## Training the Model
Fit the model.

30 epochs will take a few minutes.

we don't have a training or validation accuracy score because its a problem of text prediction.

In [None]:
model.fit(predictors, labels, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30

## Making Predictions
1. Start with a seed text
2. Prepare it in the same way we prepared our dataset (tokenizing and padding).

In [None]:
def predict_next_token(seed_text):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    prediction = model.predict_classes(token_list, verbose=0)
    return prediction

In [None]:
prediction = predict_next_token("today in new york")
prediction

In [None]:
# Use our tokenizer to decode the predicted word:
tokenizer.sequences_to_texts([prediction])

## Generate New Headlines
Predict headlines of more than just one word.

In [None]:
# Creates a new headline of arbitrary length.
def generate_headline(seed_text, next_words=1):
    for _ in range(next_words):
        # Predict next token
        prediction = predict_next_token(seed_text)
        # Convert token to word
        next_word = tokenizer.sequences_to_texts([prediction])[0]
        # Add next word to the headline. This headline will be used in the next pass of the loop.
        seed_text += " " + next_word
    # Return headline as title-case
    return seed_text.title()

In [None]:
# Try some headlines!
seed_texts = [
    'washington dc is',
    'today in new york',
    'the school district has',
    'crime has become']
for seed in seed_texts:
    print(generate_headline(seed, next_words=5))

### Conclusions:
1. Most of the headlines make some kind of grammatical sense, but don't necessarily indicate a good contextual understanding.
2. Try to run on more epochs
3. Other improvements: using pretrained embeddings with Word2Vec or GloVe, rather than learning them during training as we did with the Keras Embedding layer.
4. NLP has moved beyond simple LSTM models to Transformer-based pre-trained models, which are able to learn language context from huge amounts of textual data such as Wikipedia. 