## Hard-coding data paths

Mostly this notebook should just run. It however requires the user to fill in the cell below.

vocab_location should be the name of a pickle file (including path), which will store information about the words in the corpus

model_location should be the name of a .npz (zipped numpy) file (including path), which will store the model itself as a numpy array.

sample_location should be the path to a directory which contains your data. Each file should contain json-parsable lines. The directory can have subdirectories. The code will recursively find the files. There should be no files anywhere in the directory except those the code wishes to parse.

In [2]:
from my_config import *
# The code will look for all files in sample_location which end with file_extension

In [None]:
# Import auxillary modules
import os
import json
import numpy
import csv
import sys

In [None]:
# Import skipthoughts modules
import theano
import theano.tensor as tensor
sys.path.append('/Users/chrisn/mad-science/pythia/src/featurizers/')
from training import vocab, train, tools
import skipthoughts

In [4]:
# Import hyperopt modules
from hyperopt import hp, fmin, tpe

In [None]:
# Import pythia modules
sys.path.append('/Users/chrisn/mad-science/pythia/')
from src.utils import normalize, tokenize

In [None]:
# For evaluation purposes, import some sklearn modules
from sklearn.linear_model import LinearRegression

In [None]:
import warnings
warnings.filterwarnings('ignore')
# Because there were a lot of annoying warnings. Ignore this cell if you want to see them.

In [None]:
theano.config.floatX = 'float32'

In [None]:
theano.config.device

In [None]:
# This cell may be too memory inefficient
doc_dicts = [json.loads(line)
            for root,dirs,files in os.walk(stackoverflow_sample_location)
            for doc in files
            for line in open(os.path.join(stackoverflow_sample_location,root,doc))
            ]
# doc_dicts is a list of dictionaries, each containing document data
# In the anime sample, the text is labeled 'body_text'
# There is a field cluster_id which we will use as the categorical label
cluster_ids = [d['cluster_id'] for d in doc_dicts]
docs = [d['body_text'] for d in doc_dicts]

## Tokenization and normalization

Who knows the best way to do this? I tried to match the expectations of both the skip-thoughts code and the pythia codebase as best I could.

For each document:

1) Make list of sentences. We use utils.tokenize.punkt_sentences

2) Normalize each sentence. Remove html and make everything lower-case. We use utils.normalize.xml_normalize

3) Tokenize each sentence. Now each sentence is a string of space-separated tokens. We use utils.tokenize.word_punct_tokens and rejoin the tokens.




In [None]:
# Make list of sentences for each doc
sentenced = [tokenize.punkt_sentences(doc) for doc in docs]
# Normalize each sentence
normalized = [[normalize.xml_normalize(sentence) for sentence in doc] for doc in sentenced]
#Tokenize each sentence
tokenized = [[' '.join(tokenize.word_punct_tokens(sentence)) for sentence in doc] for doc in normalized]

In [None]:
json.dump(tokenized,open('/Users/chrisn/mad-science/pythia/data/book_corpus/model/tokenized.json','w'))

## Alternative route

In [None]:
# Instead of trying to parse in memory, can instead parse line by line and write to disk
fieldnames = ["body_text", "post_id","cluster_id", "order", "novelty"]
for root,dirs,files in os.walk(sample_location):
    for doc in files:
        if doc.endswith(file_extension):
            for line in open(os.path.join(sample_location,root,doc)):
                temp_dict = json.loads(line)
                post_id = temp_dict['post_id']
                text = temp_dict['body_text']
                sentences = tokenize.punkt_sentences(text)
                normal = [normalize.xml_normalize(sentence) for sentence in sentences]
                tokens = [' '.join(tokenize.word_punct_tokens(sentence)) for sentence in normal]
                base_doc = doc.split('.')[0]
                output_filename = "{}_{}.csv".format(base_doc,post_id)
                rel_path = os.path.relpath(root,sample_location)
                output_path = os.path.join(parsed_data_location,rel_path,output_filename)
                os.makedirs(os.path.dirname(output_path), exist_ok = True)
                with open(output_path,'w') as token_file:
                    #print(parsed_data_location,rel_path,output_filename)
                    writer = csv.DictWriter(token_file,fieldnames)
                    writer.writeheader()
                    output_dict = temp_dict
                    for token in tokens:
                        output_dict['body_text'] = token
                        writer.writerow(output_dict)

In [None]:
os.path.relpath(sample_location,root)

In [None]:
os.path.split(sample_location)

In [None]:
root

## Category labels

Each document had a cluster_id. We use this cluster_id as the categorical label for each sentence.

We create a numpy array of shape num_sentences by (num_clusters + 1). The extra cluster is for the null sentences we will mention shortly.

# This section currently incomplete. Please disregard

In [None]:
target = []
for doc_index, cluster_id in enumerate(cluster_ids):
    num_sentences = len(tokenized[doc_index]) # The number of sentences in the current document
    for i in range(num_sentences + 1:
                  target.append([])

In [None]:
target = numpy.array

In [None]:
glob.glob(sample_location,"*.json")

In [None]:
glob.glob(sample_location+"*.json")

In [None]:
sample_location+"*.json"

## An annoying issue

`tokenized` is now a list of lists. Each inner list represents a document as a list of strings, where each string represents a sentence.

The trainer expects a list of sentences. To match expectations, those inner brackets need to disappear.

However, this then looks like we have one real long document where the documents have been smashed together in arbitrary order. And the training will mistake the first sentence of one document as being part of the context of the last sentence of another. For sufficiently long documents, you can argue this is just noise. For documents that are themselves only a few sentences, this seems like too much noise.

My cludgy fix is to introduce a sentence consisting of a single null character `'\0'` and add this sentence between every document when concatenating. This may have unintended side-effects.

In [6]:
doc_separator = '\0'

In [9]:
(doc_separator+"\n").strip()

'\x00'

In [10]:
"\n".strip()

''

In [None]:
# If tokenized has been written to a filesystem and needs to be read in
sentences = []
cluster_ids = []
with open(training_data_location,'w') as outfile:
    for root, dirs, files in os.walk(parsed_data_location):
        for doc in files:
            if doc.endswith('.csv'):
                for line in csv.DictReader(open(os.path.join(root,doc))):
                    outfile.write(line['body_text'] + '\n')
                    sentences.append(line['body_text'])
                    cluster_ids.append(int(line['cluster_id']))
                outfile.write(doc_separator + '\n')
                cluster_ids.append(-1)
cluster_ids = numpy.array(cluster_ids)

In [None]:
cluster_ids

In [None]:
# We combine all documentas with a special character in between string
# This cell works if tokenized is in memory

separated = sum(zip(tokenized,[[doc_separator]]*len(tokenized)),tuple())
sentences = sum(separated,[])

In [None]:

for root, dirs, files in open(os.walk)

In [5]:
# wordcount the count of words, ordered by appearance in text
# worddict 
worddict, wordcount = vocab.build_dictionary(sentences)

NameError: name 'vocab' is not defined

In [None]:
vocab.save_dictionary(worddict, wordcount, vocab_location)

## Setting parameters

Definitely set:
* saveto: a path where the model will be periodically saved
* dictionary: where the dictionary is.

Consider tuning:
* dim_word: the dimensionality of the RNN word embeddings
* dim: the size of the hidden state
* max_epochs: the total number of training epochs

* decay_c: weight decay hyperparameter
* grad_clip: gradient clipping hyperparamter
* n_words: the size of the decoder vocabulary
* maxlen_w: the max number of words per sentence. Sentences longer than this will be ignored
* batch_size: size of each training minibatch (roughly)
* saveFreq: save the model after this many weight updates

Other options:
* displayFreq: display progress after this many weight updates
* reload_: whether to reload a previously saved model

In [None]:
params = dict(
    saveto = model_location,
    dictionary = vocab_location,
    n_words = 1000,
    dim_word = 100,
    dim = 500,
    max_epochs = 1,
    saveFreq = 100,
    )

In [None]:
train.trainer(sentences,**params)

In [None]:
model = tools.load_model()

In [None]:
tools.encode(model,sentences)

In [None]:
skipthoughts.encode(model,sentences)

## How to evaluate?

Supervised task. Apply cluster_id as label to each sentence. Run regression. Evaluate performance 

In [None]:
regression = LinearRegression()

In [None]:
regression.fit()