# Train your own skip-thoughts model

This notebook walks you through training a skip-thoughts model. It was used with semi-success to train a skip-thoughts model on a gpu-backed machine with the Pythia kernel against the entire stackexchange corpus. It was prepared over the course of 8 days, and isn't perfect.

The first section will require the user to make several data paths.

The second section requires importing several modules, including pythia and skip-thoughts-specific modules, which may depend on python paths being set correctly. It may need some tweaking.

Since the Jupyter notebook on the gpu-backed machines was unreliable, I moved some sections of this notebook into small dirty python files in a folder called skip-thoughts_training.

There is some trickiness about what a skip-thought model is. The model provided by the skip-thoughts paper was actually trained 3 times, once on its corpus, then again on its corpus with half as many dimensions, then again on its corpus with reversed-order sentences, again with half-as many dimensions. They then concatenate the results.

We don't attempt this. We merely train a single  "uni-skip" model.

The encode function found in the file skipthoughts.py insists on two models, uni-skip and bi-skip. To encode off a single uni-skip module, the encode function found in the tools.py file is needed. It unfortunately tends to die and die badly when I use it.

The last part of the code is supposed to be validation. Since my encoding is broken, validation is not tested.

Because things kept dying on me, I found it convenient to write to disk near constantly. This slows things down, but makes them more robust against dying computers.

The cells thus tend to use little memory and instead read and write in a streaming fashion, using a lot of time instead.



## Hard-coding data paths

Mostly this notebook should just run. It however requires the user to deal with one of the next 2 cells.

#### Location of the data

`sample_location` should be the path to a directory which contains your data. Each file should contain json-parsable lines. The directory can have subdirectories. The code will recursively find the files. There should be no `.json` files anywhere in the directory except those the code wishes to parse.

`path_to_word2vec` is a `.bin` word2vec file the code depends on, e.g. the Google News model founds at https://code.google.com/archive/p/word2vec/

#### Where to put output

`parsed_data_location` is a directory of `.csv` files the code will create structured the same as `sample_location`, but where the sentences have been normalized an tokenized, and where each file reprents a post.

`training_data_location` is the name of a file that will store the sentences in a single file, one per line, with null characters separating blog posts.

`vocab_location` should be the name of a pickle file (including path), which will store information about the words in the corpus

`model_location` should be the name of a .npz (zipped numpy) file (including path), which will store the model itself as a numpy array. The code will also create a .npz.pkl file with the same name containing some metadata.





In [None]:
sample_location = '/Users/chrisn/mad-science/pythia/data/stackexchange'
path_to_word2vec = '/Users/chrisn/mad-science/pythia/data/stackexchange/model/word2vecAnime.bin'
parsed_data_location = '/Users/chrisn/testing'
training_data_location = '/Users/chrisn/testing/training.txt'
vocab_location = '/Users/chrisn/mad-science/pythia/data/stackexchange/model/vocab.pickle'
model_location = '/Users/chrisn/mad-science/pythia/data/stackexchange/model/corpus.npz'

It is much easier to take the contents of the above cell, stick it in a file in your python path called my_config.py, and run the cell below.

In [1]:
from my_config import *
# The code will look for all files in sample_location which end with file_extension

I agree there are better ways to do that, either taking advantage of environment variables of parsing the config file using the `configparser` module.

I only had 2 weeks.

## Let's import some modules!

In [2]:
# Import auxillary modules
import os
import json
import numpy
import csv
import sys
import random

In [3]:
# Import theano
import theano
import theano.tensor as tensor

In [4]:
# May need to set the flag if your .theanorc isn't correct. If you want to run on gpu, you should fix your .theanorc
# And make this cell irrelevant
theano.config.floatX = 'float32'

In [5]:
# Double check that floatX is float 32
# device should be either cpu or gpu, as desired.
print(theano.config.floatX)
print(theano.config.device)

float32
cpu


So this next cell is my bad. The notebook only runs if your paths are all configured right. You may need to adjust the below cell to import pythia/skipthoughts models.

In [9]:
# Import skipthoughts modules
#sys.path.append('/Users/chrisn/mad-science/pythia/src/featurizers/skipthoughts')
from training import vocab, train, tools
import skipthoughts
# Import pythia modules
#sys.path.append('/Users/chrisn/mad-science/pythia/')
from src.utils import normalize, tokenize

In [11]:
# For evaluation purposes, import some sklearn modules
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
import pandas

In [12]:
import warnings
warnings.filterwarnings('ignore')
# Because there were a lot of annoying warnings.
# The Beautiful Soup module as used in the pythia normalization is mad about something
# And the skip-thoughts code is full of deprecation warnings about how numpy works. The warnings can crash my system

## Tokenization and normalization

Who knows the best way to do this? I tried to match the expectations of both the skip-thoughts code and the pythia codebase as best I could.

For each document:

1) Make list of sentences. We use utils.tokenize.punkt_sentences

2) Normalize each sentence. Remove html and make everything lower-case. We use utils.normalize.xml_normalize

3) Tokenize each sentence. Now each sentence is a string of space-separated tokens. We use utils.tokenize.word_punct_tokens and rejoin the tokens.

Because I had so many difficulties with things crashing, I was happy whenever I got anything done and wanted to save where I was. I also became gunshy about using memory. The solution below is thus entirely streaming. This slows it down because of file i/o.

The output of this section run on the entire stackexchange corpus can be found in /data/fs4/datasets/stackexchange_models/se_posts_parsed.tar.gz.

(Well, the tarring was done in the shell. This cell just creates a directory.)

The cell requires previously set varaibles:  `sample_location` for the input and `parsed_data_location` for the output.

A parallel version of this section can be found in `stackoverflow_normalize.py`. It unfortunately contains some hardcoded paths.

In [13]:
file_extension = ".json"

In [14]:
# Instead of trying to parse in memory, can instead parse line by line and write to disk
fieldnames = ["body_text", "post_id","cluster_id", "order", "novelty"]
for root,dirs,files in os.walk(sample_location):
    for doc in files: 
        if doc.endswith(file_extension): #Recursively find all .json files
            for line in open(os.path.join(sample_location,root,doc)):
                temp_dict = json.loads(line)
                post_id = temp_dict['post_id']
                text = temp_dict['body_text']
                sentences = tokenize.punkt_sentences(text)
                normal = [normalize.xml_normalize(sentence) for sentence in sentences]
                tokens = [' '.join(tokenize.word_punct_tokens(sentence)) for sentence in normal]
                base_doc = doc.split('.')[0]
                output_filename = "{}_{}.csv".format(base_doc,post_id)
                # Creates one output file per line of input file. Output file includes post id in name:
                # {clusterid}_{postid}.csv
                rel_path = os.path.relpath(root,sample_location)
                output_path = os.path.join(parsed_data_location,rel_path,output_filename)
                os.makedirs(os.path.dirname(output_path), exist_ok = True)
                with open(output_path,'w') as token_file:
                    #print(parsed_data_location,rel_path,output_filename)
                    writer = csv.DictWriter(token_file,fieldnames)
                    writer.writeheader()
                    output_dict = temp_dict
                    for token in tokens:
                        output_dict['body_text'] = token
                        writer.writerow(output_dict)

## Reformat to match skip-thoughts code input

`tokenized` is now a list of lists. Each inner list represents a document as a list of strings, where each string represents a sentence.

### An annoying issue

The trainer expects a list of sentences. To match expectations, those inner brackets need to disappear.

However, this then looks like we have one real long document where the documents have been smashed together in arbitrary order. And the training will mistake the first sentence of one document as being part of the context of the last sentence of another. For sufficiently long documents, you can argue this is just noise. For documents that are themselves only a few sentences, this seems like too much noise.

My cludgy fix is to introduce a sentence consisting of a single null character `'\0'` and add this sentence between every document when concatenating. This may have unintended side-effects.

As above, this notebook doesn't depend on much memory. The next cell does not assume you have `tokenized` stored and thus asks you to read it back in. I found this more convenient in the end.

The cell depends on previously defined variables `parsed_data_location` and `training_data_location` for input and output respectively.

In [15]:
doc_separator = '\0'

In [16]:
# This cell does three things
# Writes sentences to a text file one line per sentence, with the null character separating documents.
# Stores all sentences into a list
# Stores the cluster_ids into a numpy array. Each sentence gets the cluster_id of its post. So the list and numpy array
# are the same length.
sentences = []
cluster_ids = []
with open(training_data_location,'w') as outfile:
    for root, dirs, files in os.walk(parsed_data_location):
        for doc in files:
            if doc.endswith('.csv'):
                for line in csv.DictReader(open(os.path.join(root,doc))):
                    outfile.write(line['body_text'] + '\n')
                    sentences.append(line['body_text'])
                    cluster_ids.append(int(line['cluster_id']))
                outfile.write(doc_separator + '\n')
                cluster_ids.append(-1)
cluster_ids = numpy.array(cluster_ids)

## Build the skip-thoughts training dictionaries

These are pretty basic things about the whole corpus required by the skip-thoughts code.

wordcount is a dictionary of wordcounts, ordered by the order the words appear in the sentences. worddict is a dictionary of the same words, with values corresponding to their rank in the count, ordered by rank in the count.

In [17]:
# Can skip this cell if sentences is still in memory
sentences = [x.strip() for x in open(training_data_location).readlines()]

In [18]:
len(sentences)

9580

In [19]:
# wordcount the count of words, ordered by appearance in text
# worddict 
worddict, wordcount = vocab.build_dictionary(sentences)

In [20]:
vocab.save_dictionary(worddict, wordcount, vocab_location)

## Training a model

#### First set parameters

Definitely set:
* saveto: a path where the model will be periodically saved
* dictionary: where the dictionary is.

Both these should have been previously set as `model_location` and `vocab_location` respectively.

Consider tuning:
* dim_word: the dimensionality of the RNN word embeddings (Default 620)
* dim: the size of the hidden state (Default 2400)
* max_epochs: the total number of training epochs (Default 5)

* decay_c: weight decay hyperparameter (Default 0, i.e. ignored)
* grad_clip: gradient clipping hyperparamter (Default 5)
* n_words: the size of the decoder vocabulary (Default 20000)
* maxlen_w: the max number of words per sentence. Sentences longer than this will be ignored (Default 30)
* batch_size: size of each training minibatch (roughly) (Default 64)
* saveFreq: save the model after this many weight updates (Default 1000)

Other options:
* displayFreq: display progress after this many weight updates (Default 1)
* reload_: whether to reload a previously saved model (Default False)

#### Some obvervations on parameters

The default displayFreq is 1. Which seems low. It means every iteration prints something. It seems excessive. I suggest 100.

As long as the computer can handle it in memory, a bigger batch size seems better all around. I am trying 256.

A good chunk of stackexchange sentences seemed to be at least 30 tokens. I am changing that setting to 40. 

A version of this section with some hardcoded paths can be found in `stackoverflow_train.py`.

In [None]:
# Using a small set of paramters for testing
params = dict(
    saveto = model_location,
    dictionary = vocab_location,
    n_words = 1000,
    dim_word = 100,
    dim = 500,
    max_epochs = 1,
    saveFreq = 100,
    )

In [None]:
train.trainer(sentences,**params)

## Encoding sentences

The model created doesn't quite fit into the pipeline, because it is a "uni-skip" model, not a "combine skip" model. The pipeline uses skipthoughts.encode, which requires very particularly formatted models.

The model built above instead works with the encode function found in the tools model.

Except that this function often breaks.

I have not trained a "combine-skip model". The model here is the equivalent of `utable.npy`.

One would still need to train an `btable.npy` equivalent. A btable is created by training a model with half the dimension, reversing the sentences and training again, then concatenating the two models into btable. I have not done this and may be missing some subtelty.

An equally buggy version of this section is found in the file `encode_sentences.py`

In [22]:
# This cell requires hardcoded paths in tools.py to be changed. It should perhaps also be fixed to not depend
# on hardcoded paths.
embed_map = tools.load_googlenews_vectors(path_to_word2vec)
model = tools.load_model(embed_map)

Loading dictionary...
Creating inverted dictionary...
Loading model options...
Loading model parameters...
Compiling encoder...
Creating word lookup tables...
Packing up...


In [None]:
# Having a lot of trouble getting this line to not crash. It causes a "floating point exception".
tools.encode(model,sentences)

## How to evaluate?

Supervised task. Apply cluster_id as label to each sentence. Run regression. Evaluate performance.

Since there is so much stackexchange data, a random sample may be sufficient. So choose a percentage of the data to sample from. That sample will then get divided into training and testing.

In [23]:
evaluation_percent = 1 #Choose a subsample of the data
holdout_percent = 0.5 #Of that subsample, make this amount training data and the rest testing data

# e.g. 1,000,000 sentences. evaluation_percent = 0.1, holdout_percent = 0.8 --
# Choose 100,000 sentences. Then choose 80,000 of those for training and 20,000 of those for testing.

In [24]:
# Read in the sentences if they are not already in memory.
sentences = [x.strip() for x in open(training_data_location).readlines()]

In [25]:
num_sentences = len(sentences)

In [26]:
# Read in cluster ids, again if not already in memory.
cluster_ids = []
for root, dirs, files in os.walk(parsed_data_location):
    for doc in files:
        if doc.endswith('.csv'):
            for line in csv.DictReader(open(os.path.join(root,doc))):
                cluster_ids.append(int(line['cluster_id']))
            cluster_ids.append(-1)
cluster_ids = numpy.array(cluster_ids)

In [27]:
# Sanity check. Should be true.
num_sentences == len(cluster_ids)

True

In [28]:
# Sample a percentage of your data specified above as evaluation_percent.
indices = numpy.arange(num_sentences)
num_samples = int(evaluation_percent * num_sentences)
index_sample = numpy.sort(numpy.random.choice(indices, size=num_samples, replace = False))
sample_sentences = [sentences[i] for i in index_sample]
sample_clusters = cluster_ids[index_sample]

In [29]:
# Broken!!!
# This section requires the encodings of the previous section. But...
#encodings = tools.encode(model, sample_sentences)
encodings = numpy.random.rand(num_samples,10) #Since I can't get encodings to actually work. Here are some numbers.

From this point forward, the code is not well-tested because I couldn't get the encode function to work.

In [30]:
encoding_train, encoding_test, cluster_train, cluster_test = train_test_split(encodings,
                                                                              sample_clusters,
                                                                              test_size = holdout_percent)

In [31]:
regression = LinearRegression()
regression.fit(encoding_train, cluster_train)
regression.predict(encoding_test)
regression.score(encoding_test, cluster_test)

-0.0014788214640906183

## The end.

This is the end of the notebook. Below is an alternative approach. Not as well tested.

## An in-memory approach.

Because everything kept crashing on me, I ultimately found it most convenient to do everything in a streaming fashion with a lot of writing to disk at every stage. This is obviously slower than desirable. Basically I do a thing, write out the results, read the results back in, then do the next thing.

Below is an in-memory approach that reads everything into memory and pushes forward, still sometimes saving key steps to disk, but without any rereading in. Because of various technical issues, this code has never been tested at scale. It works on the anime dataset.

In [34]:
doc_dicts = [json.loads(line)
            for root,dirs,files in os.walk(sample_location)
            for doc in files
            for line in open(os.path.join(sample_location,root,doc))
            ]
# doc_dicts is a list of dictionaries, each containing document data
# In the anime sample, the text is labeled 'body_text'
# There is a field cluster_id which we will use as the categorical label
cluster_ids = [d['cluster_id'] for d in doc_dicts]
docs = [d['body_text'] for d in doc_dicts]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 15: invalid start byte

In [None]:
del(doc_dicts) # For efficiency

In [None]:
# Make list of sentences for each doc
sentenced = [tokenize.punkt_sentences(doc) for doc in docs]

In [None]:
# Normalize each sentence
normalized = [[normalize.xml_normalize(sentence) for sentence in doc] for doc in sentenced]

In [None]:
del(sentenced) #If you're done with it

In [None]:
#Tokenize each sentence
tokenized = [[' '.join(tokenize.word_punct_tokens(sentence)) for sentence in doc] for doc in normalized]

In [None]:
separated = sum(zip(tokenized,[[doc_separator]]*len(tokenized)),tuple())
sentences = sum(separated,[])

In [None]:
separated = sum(zip(tokenized,[[doc_separator]]*len(tokenized)),tuple())
sentences = sum(separated,[])

This leaves you with the sentences object in memory, leaving you ready to build the skip-thoughts training dictionaries.