# Sequence Data

In this notebook we will take a detour away from stand-alone pieces of data such as still images, to data that is dependent on other data items in a sequence. For our example, we will use text sentences. Elements in the data have a relationship with what comes before and what comes after, and this fact requires a different aproach.

## Headline Generator

We have all seen text predictors in applications lilke the search bars, on cell phones, or in text editors that provide autocompletion of sentences. Many of the good text predictor models are trained on very large datasets and take a lot of time and/or processing power to train.

## Reading in the Data

Our dataset consists of headlines from the New York Times newspaper of several months. We wil start by reading in all the headlines from the articles. The articles are in CSV files, so we can use *pandas* to read them in.

In [1]:
import os
import pandas as pd

In [2]:
nyt_dir = 'Data/nyt_dataset/'

all_headlines = []
for filename in os.listdir(nyt_dir):
    if 'Articles' in filename:
        # Read in all the data from the CSV file
        headlines_df = pd.read_csv(nyt_dir + filename)
        # Add all of the headlines to our list
        all_headlines.extend(list(headlines_df.headline.values))
len(all_headlines)

9335

In [3]:
all_headlines[:20]

['I Stand  With the ‘She-Devils’',
 'Trump’s Birth Control Problems',
 'What’s the Craziest Thing You’ve Ever Found in a Xerox Machine?',
 'U.S. Allies’ Conflict Is ISIS’ Gain',
 '$1.5 Trillion Plan on Infrastructure, but Not a Lot of Funding or Details',
 'Mueller Zeros In on a Trump Tower Cover Story',
 ' With Speech, A ‘Dreamers’ Rift Deepens',
 'At the Start',
 '‘The Assassination of Gianni Versace’ Episode 3: Death or Disgrace?',
 'Britain’s Model for Outsourcing Services May Be Cracking',
 'Unknown',
 'Real Friends vs. Facebook Friends',
 'For Kurds and Allies in Syria, U.S. Vow of Support Eases Fears on Turkey',
 'Should Schools Teach You How to Be Happy?',
 'Jessica Williams and Phoebe Robinson, Onstage',
 'Autocrats Steamroll Opponents With No Objections From U.S.',
 '32 Billion',
 'A Volcanic Idea for Cooling the Earth',
 'After the man was treated for a bad infection, tests indicated he was getting better. So why was he feeling weaker by the day?',
 'Republicans Pack Campus 

## Cleaning the Data

An important part of natural language processing (NLP) tasks (where computers deal with language), is processing tesxt in a way that coputers can understand it. We are going to take each of the words that appears in our dataser and represent it with a number. This will be part of a process calles **tokenization**.

Before we do that, we need to make sure we have good data. There are some headlines that are listed as "Unknown". We do not want these items in our training set, so we will filter them out.

In [4]:
# Remove all headlines with the value of "Unknown"
all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

8603

In [5]:
all_headlines[:20]

['I Stand  With the ‘She-Devils’',
 'Trump’s Birth Control Problems',
 'What’s the Craziest Thing You’ve Ever Found in a Xerox Machine?',
 'U.S. Allies’ Conflict Is ISIS’ Gain',
 '$1.5 Trillion Plan on Infrastructure, but Not a Lot of Funding or Details',
 'Mueller Zeros In on a Trump Tower Cover Story',
 ' With Speech, A ‘Dreamers’ Rift Deepens',
 'At the Start',
 '‘The Assassination of Gianni Versace’ Episode 3: Death or Disgrace?',
 'Britain’s Model for Outsourcing Services May Be Cracking',
 'Real Friends vs. Facebook Friends',
 'For Kurds and Allies in Syria, U.S. Vow of Support Eases Fears on Turkey',
 'Should Schools Teach You How to Be Happy?',
 'Jessica Williams and Phoebe Robinson, Onstage',
 'Autocrats Steamroll Opponents With No Objections From U.S.',
 '32 Billion',
 'A Volcanic Idea for Cooling the Earth',
 'After the man was treated for a bad infection, tests indicated he was getting better. So why was he feeling weaker by the day?',
 'Republicans Pack Campus Social Agend

We also want to remove punctuation and make our sentences all lower case, because this will make our model easier to train. For our purposes, there is little or no difference between a line ending with "!" or "?" or whateve words are capitalized, as in "The" or lower-case, as in "the". With fewer unique tokens, our model will be easier to train.

We could filter our sentences prior to tokenization, but we do not need to becasue this ccan be all done using the Keras `Tokenizer`.

## Tokenization

Right now, our dataset consists of a set of headlines, each made up of a series of words. We want to give our model a way of represenmting those words in a way that it can understand. With tokenization, we separate a piece of text into smaller chunks (tokens), which in this case are words. Each unique word is then assigned a number, as this is a way that our model can understand the data. Keras has a class that will help us tokenize our data:

```python
tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=' ',
    char_level=False,
    oov_token=None,
    document_count=0,
    **kwargs
    )
```

Taking a look at the `Tokenizer` class in Keras, we see the default values are already set up for our use case. The `filters` string already removes punctuation and the `lower` flag sets words to lower case.

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [7]:
# Tokenize the words in our headlines
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1
print("Total words: ", total_words)

Total words:  11753


We can take a look at the "word_index" dictionary to see how the tokenizer saves the words.

In [8]:
# Print a subset of the word_index dictionary created by Tokenizer
subset_dict = {key: value for key, value in tokenizer.word_index.items()\
              if key in ['a', 'man', 'a', 'plan', 'a', 'canal', 'panama']}
print(subset_dict)

{'a': 2, 'plan': 82, 'man': 137, 'panama': 2823, 'canal': 9037}


We can use the `texts_to_sequences` method to see how the tokenizer saves the words.

In [9]:
tokenizer.texts_to_sequences(['a', 'man', 'a', 'plan', 'a', 'canal', 'panama'])

[[2], [137], [2], [82], [2], [9037], [2823]]

## Creating Sequences

Now that we have tokenized the data, turning each word into representative number, we will create sequences of tokens from the headlines. These sequences are what we will train our deep learning model on.

For example. let's take the headline, "nvidia launches ray tracing gpus". Each word is going to be replaced by a corresponding number, for instance: nvidia - 5, launches - 22, ray - 94, tracing - 16, gpus - 102. The full sequence would be: [5, 22, 94, 16, 102]. However, it is also valuable to train on the smaller sequences within the headline, such as "nvidia launches". We will take each headline and create a set of sequences to fill our dataset. Nest, let's use our tokenizer to convert our headlines to a set of sequences.

In [10]:
# Convert data to sequence of tokens
input_sequences = []
for line in all_headlines:
    # Convert our headline into a sequence of tokens
    token_list = tokenizer.texts_to_sequences([line])[0]
    
    # Create a series of sequences for each headline
    for i in range(1, len(token_list)):
        partial_sequence = token_list[:i+1]
        input_sequences.append(partial_sequence)
        
print(tokenizer.sequences_to_texts(input_sequences[:5]))
input_sequences[:5]

['i stand', 'i stand with', 'i stand with the', 'i stand with the ‘she', 'i stand with the ‘she devils’']


[[30, 314],
 [30, 314, 11],
 [30, 314, 11, 1],
 [30, 314, 11, 1, 3395],
 [30, 314, 11, 1, 3395, 5242]]

## Padding sequences

Now our sequences are of various length. For our model to be able to train on the data, we need to make all the sequences the same length. To do this we will add padding to the sequences. Keras has a built-in `pad_sequences` method that we can use.

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

In [12]:
# Determine max sequence length
max_sequence_len = max([len(x) for x in input_sequences])

# Pad all sequences with zeros at the beginning to make them all max length
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len,
                                         padding='pre'))
input_sequences[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
        30, 314], dtype=int32)

## Creating predictors and Target

We also want to split up our sequences into predictors and a target. The last word of the sequence will be our target, and the first word of the sequence will be our predictors.

In [13]:
# Predictors are every word except the last
predictors = input_sequences[:, :-1]
# Labels are the last word
labels = input_sequences[:, -1]
labels[:5]

array([ 314,   11,    1, 3395, 5242], dtype=int32)

We are predicting one word out of our possible total vocabulary. Instead of the network predicting scalar numbers, we will have it predict binary categories.

In [14]:
from tensorflow.keras import utils

In [15]:
labels = utils.to_categorical(labels, num_classes=total_words)

## Creating the model

For our model, we are going to use a couple of new layers to deal with our sequential data.

**Embedding Layer**

Our first layer is an embedding layer:
```python
model.add(Embedding(input_dimension, output_dimension, input_length = input_len))
```
This layer will take the tokenized sequences and will learn an embedding for all of the words in the training dataset. Mathematically, embeddings work the same way as a neuron in a neutral network, but conceptually their goal is to reduce the number of dimensions for some or all of the features. In this case, it will represent each word as a vector, and the information within that vector will contain the relationships between each word.

**Long Short Term Memory Layer**

Our next layer, is a long short term memory layer (LSTM). An LSTM is a type of a recurrent neural network or RNN. Unlike traditional feed-forward networks, recurrent networks have loops in them, allowing information to persist.

New information (x) gets passed in to the network, which splits our prediction (h). Additionally, information from that layer gets saved, and used as input for the next prediction. When a new piece of data (x) is fed into the network, that network both spits out a prediction (h) and also passes some information along to the next layer. That next layer gets another piece of data, but gets to learn from the layer before it as well.

Traditional RNNs suffer from the issure of more recent information contributing more than information from further back. LSTMs are a special type of recurrent layer that are able to learn and retain longer term information.

In [16]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential

In [17]:
# Input is max sequence length -1, as we have removed the last word for the label
input_len = max_sequence_len - 1

model = Sequential()

# Add input embedding layer
model.add(Embedding(total_words, 10, input_length = input_len))

# Add LSTM layer with 100 units
model.add(LSTM(100))
model.add(Dropout(0.1))

# Add output layer
model.add(Dense(total_words, activation='softmax'))

In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 27, 10)            117530    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               44400     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 11753)             1187053   
Total params: 1,348,983
Trainable params: 1,348,983
Non-trainable params: 0
_________________________________________________________________


## Compiling the Model

We compile our model with categorical crossentroy (as before), as we are categorically predictiong one word from our total vocabulary. In this case, we are not going to use accuracy as a metric, because text prediction is not measured as being more or less acurate in the same way as image classification.

We are also going to select particular optimizer that is well suited for LSTM tasks, called *Adam optimizer*. What is important for optimizers is that different optimizers can be better for different deep learning tasks.

In [19]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

## Training the Model

Important thing to notice, is that we don't have a training or validation accuracy score in this case. This reflects our different problem of text prediction.

In [20]:
model.fit(predictors, labels, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f56fef507c0>

## Results

We can see that the loss decreased over the course of training. We could train our model further to decrease the loss, but that would takse some time, and we are not looking for a perfect text predictor.

## Making predictions

In order to make predictions, we will need to start with a seed text, and prepare it in the same way we prepared our dataset. This will mean tokenizing and padding. We will create function to be able to make a prediction.

In [21]:
def predict_next_token(seed_text):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    prediction = model.predict_classes(token_list, verbose=0)
    return prediction

In [23]:
prediction = predict_next_token('today in new york')
prediction

array([5])

Let's use our tokenizer to decode the predicted word.

In [24]:
tokenizer.sequences_to_texts([prediction])

['in']

## Generate new headlines

Now that we are able to predict new words, let's create a function that can predict headlines of more than just one word.

In [25]:
# This function creates a new headline of arbitrary length
def generate_headlines(seed_text, next_words=1):
    for _ in range(next_words):
        # Predict next token
        prediction = predict_next_token(seed_text)
        
        # Convert token to word
        next_word = tokenizer.sequences_to_texts([prediction])[0]
        
        # Add next word to the headline. This headlines will be used in the next pass of the loop.
        seed_text += " " + next_word
        
    # Return headline as title_case
    return seed_text.title()

In [26]:
seed_texts = ['washington dc is',
             'today in new york',
             'the school district has',
             'crime has become']
for seed in seed_texts:
    print(generate_headlines(seed, next_words=5))

Washington Dc Is The Trump Right Not Too
Today In New York In New ‘America Kong Brooklyn’
The School District Has The New York Times Affects
Crime Has Become A Brick Wall On Top


Despite of the 30 epochs of training we can notice that most of the headlines make some kind of grammatical sense, but don't necessarily indicate a good contextual understanding. The results might improve somewhat by running more epochs. We can do this by running the training `fit` cell again (and again!) to train antoher 30 epochs each time. We should see that the loss value go down. After that try the tests again and results can vary quite a bit!

Other improvement would be to try using pretrained embeddings with Word2Vec or GloVe, rather than learning them during training as we did with the Keras Embadding layer.

 Ultimately, however, NLP has moved beyond simple LSTM models to Tranformer-based pre-trained models, which are able to learn language context from huge amounts of textual data such as Wikipedia. These pre-trained models are then used as a starting point for transfer learning to solve NLP tasks such as the one we just tried for text completion.
 
 We have successfully trained a model to predict words in a headline and used that model to create headlines of various lengths.

In [27]:
# Clear the GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}