# Me, messing around

*Julia Piaskowski*    
*2020-04-20*

## Example 1: 
Following parts of this:

https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms


This is using NYT comments data set that consists of several CSV files for articles by selected months and comments.  

In [4]:
# for data utility function

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
# for nlp functions 

import torchtext
from torchtext.data import get_tokenizer

#### Load Data

In [6]:
# directory for all data
data_dir = '../data/mydata/nyt-comments/'
# this dataset can be downloaded here:
# https://www.kaggle.com/aashita/nyt-comments

# First, Get Article Headlines
all_headlines = []

for filename in os.listdir(data_dir):
    if 'Articles' in filename:
        article_df = pd.read_csv(data_dir + filename)
        all_headlines.extend(list(article_df.headline.values))
        break

all_headlines = [h for h in all_headlines if h != "Unknown"]


In [7]:
# check it worked: 
len(all_headlines)

831

#### Clean Text

This function removes punction, converts everthing to lowercase, and simplies the encoding from UTF-8 to ASCII. It's really nice!

In [8]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in all_headlines]
# check that it worked
corpus[:10]

['finding an expansive view  of a forgotten people in niger',
 'and now  the dreaded trump curse',
 'venezuelas descent into dictatorship',
 'stain permeates basketball blue blood',
 'taking things for granted',
 'the caged beast awakens',
 'an everunfolding story',
 'oreilly thrives as settlements add up',
 'mouse infestation',
 'divide in gop now threatens trump tax plan']

In [9]:
tokenizer = get_tokenizer("basic_english")

tokens = [tokenizer(x) for x in corpus]

In [10]:
print(len(tokens))
print(tokens[0])

831
['finding', 'an', 'expansive', 'view', 'of', 'a', 'forgotten', 'people', 'in', 'niger']


That was fun, but this is a pain. Next step is to generate n-grams of each headine and pad the sequences. It would also help if were deaing with symbolic integers and not actual words at this time. 

### Now Keras: 

In [11]:
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout

from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

Using TensorFlow backend.


In [12]:
# this one is clearer, easier, faster:

tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[169, 17],
 [169, 17, 665],
 [169, 17, 665, 367],
 [169, 17, 665, 367, 4],
 [169, 17, 665, 367, 4, 2],
 [169, 17, 665, 367, 4, 2, 666],
 [169, 17, 665, 367, 4, 2, 666, 170],
 [169, 17, 665, 367, 4, 2, 666, 170, 5],
 [169, 17, 665, 367, 4, 2, 666, 170, 5, 667],
 [6, 80]]

In [13]:
print(len(inp_sequences))

4806


Now, we have a collection of sequences broke in to $1, 2,,,,k_i$ n-grams (where $k_i$ is the length of each line of tokens from a single headline.

All these sequences need to be padded to the length of the longest sequence (the maximum of $k_i$). 

In [14]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

In [15]:
print(len(label), len(predictors))
print(max_sequence_len)
print(len(predictors[4]))
print(predictors[0])
print(predictors[1])
print(predictors[2])
print(predictors[4])


4806 4806
19
18
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 169]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 169  17]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 169  17 665]
[  0   0   0   0   0   0   0   0   0   0   0   0   0 169  17 665 367   4]


In [16]:
print(len(label[4]))  # length is consistent 2422 - the number of unique words
print(np.unique(label)) 
    # just a yes/no for the next word in the sequence from the predictor matrix
print(label[4]) # mostly zeros, of course

2422
[0. 1.]
[0. 0. 1. ... 0. 0. 0.]


#### Build Model

This is largely from the example, with a change to the optimizer (gradient clipping added).

Model is to predict next word in sequence based on a collection words 

In [17]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
        # I'm a little unclear on this. It's a sparse matrix solution. 
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1, a LSTM Layer w/100 nodes
    model.add(LSTM(100))
    model.add(Dropout(0.1)) # and a dropout function
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))
    
    # optimiser
    Adam = keras.optimizers.Adam(learning_rate=0.001, clipnorm = 1)

    # add it all together
    model.compile(loss='categorical_crossentropy', optimizer=Adam)
    
    return model

model = create_model(max_sequence_len, total_words)
#model.summary()

In [18]:
model.fit(predictors, label, epochs=100, verbose=0)
# default batch size is 32 
# verbose = 0 is for silent
# shuffles by default
# loads of other parameters: https://keras.io/models/model/#fit

<keras.callbacks.callbacks.History at 0x1529de438>

#### Assess model accuracy

In [20]:
# here's our total loss:  
# (using loss function defined in model)
model.evaluate(predictors, label, batch_size = 100)



1.3524951553820967

In [37]:
# % correct
preds = model.predict(predictors)
# print(np.argmax(preds[0]))
# print(np.argmax(label[0]))

correct = 0

for i,j in zip(preds, label):
    if np.argmax(i) == np.argmax(j):
        correct += 1

np.round(correct/len(preds) * 100, 2)

9
17


73.41

#### Text Generation 

(the good stuff)

In [38]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [39]:
print(generate_text("idaho", 5, model, max_sequence_len))

Idaho Days Of The Bluff The


In [40]:
print(generate_text("idaho",12, model, max_sequence_len))

Idaho Days Of The Bluff The Limits Of Trumps Negotiation Strategy Apart Niger


In [41]:
print(generate_text("india", 5, model, max_sequence_len))

India A Song Of Face Creams


In [42]:
print(generate_text("wall street", 5, model, max_sequence_len))

Wall Street Bannon Carrier Moment From Statins


In [44]:
print(generate_text("climate change", 7, model, max_sequence_len))

Climate Change Became Ms On The Delta Used Of


In [48]:
print(generate_text("scientist", 7, model, max_sequence_len))

Scientist Days Of The Bluff The Limits Of


In [49]:
print(generate_text("apple", 7, model, max_sequence_len))

Apple Days Of The Bluff The Limits Of
