
# Reading the Data
We'll start by reading in all the headlines from the articles. The articles are in CVS format, so we use *pandas* to read them in.

In [26]:
import os
import pandas as pd
nyt_dir = 'data/nyt_dataset/articles/'

all_headlines = []
for filename in os.listdir(nyt_dir):
    if 'Articles' in filename:
        # Read in all the data from csv
        headlines_df = pd.read_csv(nyt_dir + filename)
        # Add all the headlines to our list
        all_headlines.extend(list(headlines_df.headline.values)) #todo lookup .extend
len(all_headlines)

9335

In [27]:
all_headlines[20:40]

['Initial Description',
 'Rough Estimates',
 'El Pasatiempo Nacional',
 'Cooling Off on a Hot Day at Yankee Stadium',
 'Trump’s Staff Mixed Politics and Paydays',
 'A Virtuoso Rebuilding Act Requires Everyone in Tune',
 '‘Homeland,’ Season 6, Episode 11: Is Quinn Just a Natural Killer?',
 '‘Big Little Lies’ and the Art of Empathy',
 'Upending a Whodunit',
 '‘Feud: Bette and Joan’ Episode 5: Taking the Stage',
 '‘Billions’ Season 2, Episode 7: Greed Is Good. Except When It’s Not.',
 'Unknown',
 'What’s Going On in This Picture? | April 3, 2017',
 'Unknown',
 'Have You Ever Felt Pressured by Family or Others in Making an Important Decision About Your Future?',
 'Unknown',
 'A Cornerstone of Peace at Risk',
 'Trump Is  Wimping Out on Trade',
 'The Dwindling Odds of Coincidence',
 'What Was Lenin Thinking?']

# Cleaning the data

In [28]:
all_headlines = [h for h in all_headlines if h != 'Unknown'] # TODO lup
len(all_headlines)

8603

We also want to remove punctuation and make our sentences all lower case, because this will make our model easier to train. For our purposes, there is little or no difference between a line ending with "!" or "?" or whether words are capitalized, as in "The" or lower-case, as in "the". With fewer unique tokens, our model will be easier to train.
# Tokenization
```python
tensorflow.keras.preprocessing.text.Tokenizer(
    num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,
    split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)
```

In [29]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Tokenize the words in our headlines
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1
print('Total words: ', total_words)

Total words:  11753


We can take a quick look at word_index dictionary to see how the tokenizer saves the words

In [30]:
# Print a subset of the word_index dictionary created by Tokenizer
subset_dict = {key: value for key, value in tokenizer.word_index.items() \
               if key in ['a','man','a','plan','a','canal','panama']}
print(subset_dict)

{'a': 2, 'plan': 82, 'man': 138, 'panama': 3379, 'canal': 7144}


In [31]:
tokenizer.sequences_to_texts([[1]]) # 1st word met by tokenizer
tokenizer.texts_to_sequences(['a','man','a','plan','a','canal','panama'])

[[2], [138], [2], [82], [2], [7144], [3379]]

# Creating a Sequences


In [32]:
# Convert data to sequence of tokens
input_sequences = []
for line in all_headlines:
    # Convert our headline into a sequence of tokens
    token_list = tokenizer.texts_to_sequences([line])[0]

    # Create a series of sequences for each headline
    for i in range(1,len(token_list)):
        partial_sequence = token_list[:i+1]
        input_sequences.append(partial_sequence)
print(tokenizer.sequences_to_texts(input_sequences[:9]))
input_sequences[:9]

['finding an', 'finding an expansive', 'finding an expansive view', 'finding an expansive view of', 'finding an expansive view of a', 'finding an expansive view of a forgotten', 'finding an expansive view of a forgotten people', 'finding an expansive view of a forgotten people in', 'finding an expansive view of a forgotten people in niger']


[[403, 17],
 [403, 17, 5242],
 [403, 17, 5242, 543],
 [403, 17, 5242, 543, 4],
 [403, 17, 5242, 543, 4, 2],
 [403, 17, 5242, 543, 4, 2, 1616],
 [403, 17, 5242, 543, 4, 2, 1616, 151],
 [403, 17, 5242, 543, 4, 2, 1616, 151, 5],
 [403, 17, 5242, 543, 4, 2, 1616, 151, 5, 1992]]

# Padding Sequences

In [33]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

#Determine max sequence length
max_sequence_length =  max([len(x) for x in input_sequences])

# Pad all sequences with zeros at the beginning to make them all max length
input_sequences = np.array(pad_sequences(input_sequences,maxlen=max_sequence_length, padding='pre'))
input_sequences[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       403,  17])

In [34]:
tokenizer.sequences_to_texts([[403,17]])

['finding an']

# Creating Predictions and Target
We also want to split our sequences into predictions and a target. The last words of the sequence will be our target, and the first words of the sequence will be our predictors.

In [35]:
# Moving though data
'''
input_sequences[-1] gives us last row
input_sequences[:,-1] gives us last column
[:] -> means we want columns not rows [,-1] => means we want last column
'''
# Predictors are every word expect the last
predictors = input_sequences[:,:-1]
# Labels are the last word
labels = input_sequences[:,-1]
labels[:5]

array([  17, 5242,  543,    4,    2])

Like our earlier sections, these targets are categorical. We are predicting one word out of our possible total vocabulary. Instead of the network predicting scalar numbers, we will have it predict binary categories.

In [36]:
from tensorflow.keras import utils

labels = utils.to_categorical(labels, num_classes=total_words)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

# Creating the Model
For our model we're going to use a couple of new layers to deal with our sequential data


# Understanding of the model
Can be found in [headline_gen_model_understanding.ipynb](https://github.com/GalaxUniv/Learning-Data/tree/main/Nvidia/Fundamentals%20of%20Deep%20Learning/06_headline_generator/headline_gen_model_understanding.ipynb)

In [37]:
from tensorflow.keras.layers import Embedding, LSTM, Dense,Dropout
from tensorflow.keras.models import Sequential

input_len = max_sequence_length - 1

model = Sequential()

# Add input embedding layer
model.add(Embedding(total_words, 10,input_length = input_len))

# Add LSTM layer with 100 units
model.add(LSTM(100))
model.add(Dropout(0.1))

# Add model output
model.add(Dense(total_words, activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 27, 10)            117530    
                                                                 
 lstm (LSTM)                 (None, 100)               44400     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 11753)             1187053   
                                                                 
Total params: 1,348,983
Trainable params: 1,348,983
Non-trainable params: 0
_________________________________________________________________


# Compiling Model
We are going to select a particular optimizer that is well suited for LSTM tasks, called Adam optimizer.

In [39]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Training the Model
We will train the model for 30 epochs,which will take a few minutes.

In [41]:
model.fit(predictors,labels,epochs=30,verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x13f5cb3cfd0>

# Discussion of Results
We can see that loss decreased over the course of training. We could train our model further to decrees the loss, but that would take some time.

# Making predictions

In [53]:
def predict_next_token(seed_text):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen= max_sequence_length - 1, padding='pre')
    prediction = [np.argmax(model.predict(token_list,verbose=0))]
    return prediction
pred = predict_next_token('today in new york')
pred

[7107]

In [54]:
tokenizer.sequences_to_texts([pred])

['subway’s']

# Generate New Headlines

In [55]:
def generate_new_headlines(seed_text,next_words = 1):
    for _ in range(next_words):
        # Predict next token
        prediction =predict_next_token(seed_text)
        # Convert token to words
        next_word = tokenizer.sequences_to_texts([prediction])[0]
        # Add next word to the seed_text.
        seed_text += " " + next_word
    return seed_text.title()

Now lets try it

In [60]:
seed_text = [
    'washington dc is',
    'today in new york',
    'the school district has',
    'crime has become'
]
for seed in seed_text:
    print(generate_new_headlines(seed,next_words=6))

Washington Dc Is Moscow For A Hike On Wall
Today In New York Subway’S The Cadaver A 1 P
The School District Has The Odd It Just All It
Crime Has Become A Singular Task Of The Newspaper


# Summary
The result may be a bit underwhelming after 30 epochs of training. We can notice that most of the headlines make some kind of grammatical sense, but don't necessarily indicate a good contextual understanding. The results might improve somewhat by running more epochs.

### Other improvements
We could try using pretrained embeddings with Word2Vec or GloVe, rather than learning them during training as we did with the Keras Embedding layer.

### Ultimately
NLP has moved beyond simple LSTM model to Transformer-based-pre-trained models, which are able to learn language from context from huge amounts of textual data such as Wikipedia. These pre-trained models are then used as a starting point for transfer learning to solve NLP tasks such as the one we just tried for text completion