# Me, messing around

*Julia Piaskowski*    
*2020-04-20*

## Example 1: 
Following parts of this:

https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms


This is using NYT comments data set that consists of several CSV files for articles by selected months and comments.  

In [6]:
# for data utility function

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

In [16]:
# for nlp functions 

import torchtext
from torchtext.data import get_tokenizer

#### Load Data

In [10]:
# directory for all data
data_dir = '../data/mydata/nyt-comments/'
# this dataset can be downloaded here:
# https://www.kaggle.com/aashita/nyt-comments

# First, Get Article Headlines
all_headlines = []

for filename in os.listdir(data_dir):
    if 'Articles' in filename:
        article_df = pd.read_csv(data_dir + filename)
        all_headlines.extend(list(article_df.headline.values))
        break

all_headlines = [h for h in all_headlines if h != "Unknown"]


In [11]:
# check it worked: 
len(all_headlines)

831

#### Clean Text

This function removes punction, converts everthing to lowercase, and simplies the encoding from UTF-8 to ASCII. It's really nice!

In [12]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in all_headlines]
# check that it worked
corpus[:10]

['finding an expansive view  of a forgotten people in niger',
 'and now  the dreaded trump curse',
 'venezuelas descent into dictatorship',
 'stain permeates basketball blue blood',
 'taking things for granted',
 'the caged beast awakens',
 'an everunfolding story',
 'oreilly thrives as settlements add up',
 'mouse infestation',
 'divide in gop now threatens trump tax plan']

In [22]:
tokenizer = get_tokenizer("basic_english")

tokens = [tokenizer(x) for x in corpus]

In [24]:
print(len(tokens))
print(tokens[0])

831
['finding', 'an', 'expansive', 'view', 'of', 'a', 'forgotten', 'people', 'in', 'niger']


That was fun, but this is a pain. Next step is to generate n-grams of each headine and pad the sequences. It would also help if were deaing with symbolic integers and not actual words at this time. 

### Now Keras: 

In [37]:
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout

from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

In [26]:
# this one is clearer, easier, faster:

tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[169, 17],
 [169, 17, 665],
 [169, 17, 665, 367],
 [169, 17, 665, 367, 4],
 [169, 17, 665, 367, 4, 2],
 [169, 17, 665, 367, 4, 2, 666],
 [169, 17, 665, 367, 4, 2, 666, 170],
 [169, 17, 665, 367, 4, 2, 666, 170, 5],
 [169, 17, 665, 367, 4, 2, 666, 170, 5, 667],
 [6, 80]]

In [31]:
print(len(inp_sequences))

4806


Now, we have a collection of sequences broke in to $1, 2,,,,k_i$ n-grams (where $k_i$ is the length of each line of tokens from a single headline.

All these sequences need to be padded to the length of the longest sequence (the maximum of $k_i$). 

In [32]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

In [50]:
print(len(label), len(predictors))
print(max_sequence_len)
print(len(predictors[4]))
print(predictors[0])
print(predictors[1])
print(predictors[2])
print(predictors[4])


4806 4806
19
18
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 169]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 169  17]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 169  17 665]
[  0   0   0   0   0   0   0   0   0   0   0   0   0 169  17 665 367   4]
2422
[0. 0. 1. ... 0. 0. 0.]


In [60]:
print(len(label[4]))  # length is consistent 2422 - the number of unique words
print(np.unique(label)) 
    # just a yes/no for the next word in the sequence from the predictor matrix
print(label[4]) # mostly zeros, of course

2422
[0. 1.]
[0. 0. 1. ... 0. 0. 0.]


#### Build Model

This is largely from the example, with a change to the optimizer (gradient clipping added).

Model is to predict next word in sequence based on a collection words 

In [39]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
        # I'm a little unclear on this. It's a sparse matrix solution. 
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1, a LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))
    
    Adam = keras.optimizers.Adam(learning_rate=0.001, clipnorm = 1)

    model.compile(loss='categorical_crossentropy', optimizer=Adam)
    
    return model

model = create_model(max_sequence_len, total_words)
#model.summary()

In [61]:
model.fit(predictors, label, epochs=100, verbose=0)
# default batch size is 32 
# verbose = 0 is for silent
# shuffles by defulat
# loads of other parameters: https://keras.io/models/model/#fit

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.callbacks.History at 0x153f19f28>

should probably check how to assess model accuracy

#### Text Generation 

(the good stuff)

In [62]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [68]:
print(generate_text("idaho", 5, model, max_sequence_len))

Idaho Is There Really A Small


In [72]:
print(generate_text("idaho",15, model, max_sequence_len))

Idaho Is There Really A Small Army On The Bahamas Shelter Swamp Many Readers Heads Heads


In [67]:
print(generate_text("india", 5, model, max_sequence_len))

India The Dreaded Trump Curse Smart


In [65]:
print(generate_text("wall street", 5, model, max_sequence_len))

Wall Street Subjects A Vacation Is Could
