## Hard-coding data paths

Mostly this notebook should just run. It however requires the user to fill in the cell below.

vocab_location should be the name of a pickle file (including path), which will store information about the words in the corpus

model_location should be the name of a .npz (zipped numpy) file (including path), which will store the model itself as a numpy array.

sample_location should be the path to a directory which contains your data. Each file should contain json-parsable lines. The directory can have subdirectories. The code will recursively find the files. There should be no files anywhere in the directory except those the code wishes to parse.

In [1]:
from my_config import *
# The code will look for all files in sample_location which end with file_extension

In [16]:
path_to_word2vec = '/Users/chrisn/mad-science/pythia/data/stackexchange/model/word2vecAnime.bin'

In [36]:
# Import auxillary modules
import os
import json
import numpy
import csv
import sys
import random

In [6]:
# Import skipthoughts modules
import theano
import theano.tensor as tensor
sys.path.append('/Users/chrisn/mad-science/pythia/src/featurizers/')
from training import vocab, train, tools
import skipthoughts

In [None]:
# Import pythia modules
sys.path.append('/Users/chrisn/mad-science/pythia/')
from src.utils import normalize, tokenize

In [14]:
from gensim.models import Word2Vec

In [3]:
# For evaluation purposes, import some sklearn modules
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
import pandas

In [4]:
import warnings
warnings.filterwarnings('ignore')
# Because there were a lot of annoying warnings.
# The Beautiful Soup module as used in the pythia normalization is mad about something
# And the skip-thoughts code is full of deprecation warnings about how numpy works. The warnings can crash my system

In [None]:
theano.config.floatX = 'float32'

In [None]:
theano.config.floatX

In [None]:
theano.config.device

## Tokenization and normalization

Who knows the best way to do this? I tried to match the expectations of both the skip-thoughts code and the pythia codebase as best I could.

For each document:

1) Make list of sentences. We use utils.tokenize.punkt_sentences

2) Normalize each sentence. Remove html and make everything lower-case. We use utils.normalize.xml_normalize

3) Tokenize each sentence. Now each sentence is a string of space-separated tokens. We use utils.tokenize.word_punct_tokens and rejoin the tokens.




In [None]:
# Instead of trying to parse in memory, can instead parse line by line and write to disk
fieldnames = ["body_text", "post_id","cluster_id", "order", "novelty"]
for root,dirs,files in os.walk(sample_location):
    for doc in files:
        if doc.endswith(file_extension):
            for line in open(os.path.join(sample_location,root,doc)):
                temp_dict = json.loads(line)
                post_id = temp_dict['post_id']
                text = temp_dict['body_text']
                sentences = tokenize.punkt_sentences(text)
                normal = [normalize.xml_normalize(sentence) for sentence in sentences]
                tokens = [' '.join(tokenize.word_punct_tokens(sentence)) for sentence in normal]
                base_doc = doc.split('.')[0]
                output_filename = "{}_{}.csv".format(base_doc,post_id)
                rel_path = os.path.relpath(root,sample_location)
                output_path = os.path.join(parsed_data_location,rel_path,output_filename)
                os.makedirs(os.path.dirname(output_path), exist_ok = True)
                with open(output_path,'w') as token_file:
                    #print(parsed_data_location,rel_path,output_filename)
                    writer = csv.DictWriter(token_file,fieldnames)
                    writer.writeheader()
                    output_dict = temp_dict
                    for token in tokens:
                        output_dict['body_text'] = token
                        writer.writerow(output_dict)

In [None]:
os.path.relpath(sample_location,root)

In [None]:
os.path.split(sample_location)

In [None]:
root

## An annoying issue

`tokenized` is now a list of lists. Each inner list represents a document as a list of strings, where each string represents a sentence.

The trainer expects a list of sentences. To match expectations, those inner brackets need to disappear.

However, this then looks like we have one real long document where the documents have been smashed together in arbitrary order. And the training will mistake the first sentence of one document as being part of the context of the last sentence of another. For sufficiently long documents, you can argue this is just noise. For documents that are themselves only a few sentences, this seems like too much noise.

My cludgy fix is to introduce a sentence consisting of a single null character `'\0'` and add this sentence between every document when concatenating. This may have unintended side-effects.

In [None]:
doc_separator = '\0'

In [None]:
sentences = []
cluster_ids = []
with open(training_data_location,'w') as outfile:
    for root, dirs, files in os.walk(parsed_data_location):
        for doc in files:
            if doc.endswith('.csv'):
                for line in csv.DictReader(open(os.path.join(root,doc))):
                    outfile.write(line['body_text'] + '\n')
                    sentences.append(line['body_text'])
                    cluster_ids.append(int(line['cluster_id']))
                outfile.write(doc_separator + '\n')
                cluster_ids.append(-1)
cluster_ids = numpy.array(cluster_ids)

In [None]:
len(cluster_ids)

In [None]:
# Can skip this cell if sentences is still in memory
sentences = [x.strip() for x in open(training_data_location).readlines()]

In [None]:
len(sentences)

In [None]:
# wordcount the count of words, ordered by appearance in text
# worddict 
worddict, wordcount = vocab.build_dictionary(sentences)

In [None]:
vocab.save_dictionary(worddict, wordcount, vocab_location)

## Setting parameters

Definitely set:
* saveto: a path where the model will be periodically saved
* dictionary: where the dictionary is.

Consider tuning:
* dim_word: the dimensionality of the RNN word embeddings (Default 620)
* dim: the size of the hidden state (Default 2400)
* max_epochs: the total number of training epochs (Default 5)

* decay_c: weight decay hyperparameter (Default 0, i.e. ignored)
* grad_clip: gradient clipping hyperparamter (Default 5)
* n_words: the size of the decoder vocabulary (Default 20000)
* maxlen_w: the max number of words per sentence. Sentences longer than this will be ignored (Default 30)
* batch_size: size of each training minibatch (roughly) (Default 64)
* saveFreq: save the model after this many weight updates (Default 1000)

Other options:
* displayFreq: display progress after this many weight updates (Default 1)
* reload_: whether to reload a previously saved model (Default False)

## Some obvervations on parameters

The default displayFreq is 1. Which seems low. It means every iteration prints something. It seems excessive. I suggest 100.

As long as the computer can handle it in memory, a bigger batch size seems better all around. I am trying 256.

A good chunk of stackexchange sentences seemed to be at least 30 tokens. I am changing that setting to 40. 

In [None]:
# Using a small set of paramters for testing
params = dict(
    saveto = model_location,
    dictionary = vocab_location,
    n_words = 1000,
    dim_word = 100,
    dim = 500,
    max_epochs = 1,
    saveFreq = 100,
    )

In [None]:
train.trainer(sentences,**params)

The model created doesn't quite fit into the pipeline, because it is a "uni-skip" model, not a "combine skip" model. The pipeline uses skipthoughts.encode, which requires very particularly formatted models.

In [7]:
model = tools.load_model([])

Loading dictionary...
Creating inverted dictionary...
Loading model options...
Loading model parameters...
Compiling encoder...
Creating word lookup tables...


AttributeError: 'list' object has no attribute 'vocab'

In [None]:
# Having a lot of trouble getting this line to not crash.
tools.encode(model,sentences)

## How to evaluate?

Supervised task. Apply cluster_id as label to each sentence. Run regression. Evaluate performance 

Choose a percentage of the data to be the training data

In [24]:
evaluation_percent = 1 #Choose a subsample of the data
holdout_percent = 0.5 #Of that subsample, make this amount training data and the rest testing data

In [25]:
# Read in the sentences
sentences = [x.strip() for x in open(training_data_location).readlines()]

In [26]:
num_sentences = len(sentences)

In [27]:
# Read in cluster ids
cluster_ids = []
for root, dirs, files in os.walk(parsed_data_location):
    for doc in files:
        if doc.endswith('.csv'):
            for line in csv.DictReader(open(os.path.join(root,doc))):
                cluster_ids.append(int(line['cluster_id']))
            cluster_ids.append(-1)
cluster_ids = numpy.array(cluster_ids)

In [28]:
# Sanity check. Should be true.
num_sentences == len(cluster_ids)

True

In [52]:
indices = numpy.arange(num_sentences)
num_samples = int(evaluation_percent * num_sentences)
index_sample = numpy.sort(numpy.random.choice(indices, size=num_samples, replace = False))

In [63]:
sample_sentences = [sentences[i] for i in index_sample]
sample_clusters = cluster_ids[index_sample]

4790

In [38]:
# This cell can easily kill my notebook's memory
# Instead I recommend the command-line scri
model = tools.load_model()

Loading dictionary...
Creating inverted dictionary...
Loading model options...
Loading model parameters...
Compiling encoder...
Loading word2vec embeddings...


KeyboardInterrupt: 

In [55]:
# Broken!!!
#encodings = tools.encode(model, sample_sentences)
encodings = numpy.random.rand(num_samples,10) #Since I can't get encodings to actually work. Here are some numbers.

In [70]:
encoding_train.shape

(4790, 10)

In [69]:
encodings.shape

(9580, 10)

In [71]:
sample_clusters.shape

(9580,)

In [75]:
encoding_train, encoding_test, cluster_train, cluster_test = train_test_split(encodings,
                                                                              sample_clusters,
                                                                              test_size = holdout_percent)

In [74]:
encoding_test.shape

(4790,)

In [78]:
regression = LinearRegression()
regression.fit(encoding_train, cluster_train)
regression.predict(encoding_test)
regression.score(encoding_test, cluster_test)

-0.002452444266647591

## An in-memory approach.

Because everything kept crashing on me, I ultimately found it most convenient to do everything in a streaming fashion with a lot of writing to disk at every stage. This is obviously slower than desirable. Basically I do a thing, write out the results, read the results back in, then do the next thing.

Below is an in-memory approach that reads everything into memory and pushes forward, still sometimes saving key steps to disk, but without any rereading in. Because of various technical issues, this code has never been tested at scale. It works on the anime dataset.

In [None]:
# This cell may be too memory inefficient
doc_dicts = [json.loads(line)
            for root,dirs,files in os.walk(stackoverflow_sample_location)
            for doc in files
            for line in open(os.path.join(stackoverflow_sample_location,root,doc))
            ]
# doc_dicts is a list of dictionaries, each containing document data
# In the anime sample, the text is labeled 'body_text'
# There is a field cluster_id which we will use as the categorical label
cluster_ids = [d['cluster_id'] for d in doc_dicts]
docs = [d['body_text'] for d in doc_dicts]

In [None]:
del(doc_dicts) # For efficiency

In [None]:
# Make list of sentences for each doc
sentenced = [tokenize.punkt_sentences(doc) for doc in docs]

In [None]:
# Normalize each sentence
normalized = [[normalize.xml_normalize(sentence) for sentence in doc] for doc in sentenced]

In [None]:
del(sentenced) #If you're done with it

In [None]:
#Tokenize each sentence
tokenized = [[' '.join(tokenize.word_punct_tokens(sentence)) for sentence in doc] for doc in normalized]

json.dump(tokenized,open('/Users/chrisn/mad-science/pythia/data/book_corpus/model/tokenized.json','w'))

In [None]:
separated = sum(zip(tokenized,[[doc_separator]]*len(tokenized)),tuple())
sentences = sum(separated,[])

In [None]:
del(tokenized)

In [None]:
separated = sum(zip(tokenized,[[doc_separator]]*len(tokenized)),tuple())
sentences = sum(separated,[])

In [11]:
6164553/64

96321.140625

In [8]:
from gensim.models import Word2Vec

In [10]:
Word2Vec.load_word2vec_format()

RuntimeError: invalid file URI: 

In [17]:
embed_map = Word2Vec.load_word2vec_format(path_to_word2vec, binary=True)

In [20]:
embed_map.vector_size

100

In [21]:
embed = tools.load_googlenews_vectors('/Users/chrisn/mad-science/pythia/data/google_news/GoogleNews-vectors-negative300.bin.gz')

In [22]:
embed.vector_size

300