## Hard-coding data paths

Mostly this notebook should just run. It however requires the user to fill in the cell below.

vocab_location should be the name of a pickle file (including path), which will store information about the words in the corpus

model_location should be the name of a .npz (zipped numpy) file (including path), which will store the model itself as a numpy array.

sample_location should be the path to a directory which contains your data. Each file should contain json-parsable lines. The directory can have subdirectories. The code will recursively find the files. There should be no files anywhere in the directory except those the code wishes to parse.

In [1]:
datafile = '/Users/chrisn/data/book_corpus/books_txt/Adventure/100290.txt'
vocab_location = '/Users/chrisn/mad-science/pythia/data/book_corpus/model/vocab.pickle'
model_location = '/Users/chrisn/mad-science/pythia/data/book_corpus/model/corpus.npz'
stackoverflow_sample_location = '/Users/chrisn/mad-science/pythia/data/stackexchange/anime'

In [21]:
# Import auxillary modules
import os
import json
import numpy

In [3]:
# Import skipthoughts modules
import sys
import theano
import theano.tensor as tensor
sys.path.append('/Users/chrisn/mad-science/pythia/src/featurizers/')
from training import vocab, train, tools
import skipthoughts

In [4]:
# Import hyperopt modules
from hyperopt import hp, fmin, tpe

In [5]:
# Import pythia modules
sys.path.append('/Users/chrisn/mad-science/pythia/')
from src.utils import normalize, tokenize

In [6]:
# For evaluation purposes, import some sklearn modules
from sklearn.linear_model import LinearRegression

In [7]:
import warnings
warnings.filterwarnings('ignore')
# Because there were a lot of annoying warnings. Ignore this cell if you want to see them.

In [8]:
theano.config.floatX = 'float32'

In [20]:
doc_dicts = [json.loads(line)
            for root,dirs,files in os.walk(stackoverflow_sample_location)
            for doc in files
            for line in open(os.path.join(stackoverflow_sample_location,root,doc))
            ]
# doc_dicts is a list of dictionaries, each containing document data
# In the anime sample, the text is labeled 'body_text'
# There is a field cluster_id which we will use as the categorical label
docs = [d['body_text'] for d in doc_dicts]
cluster_ids = [d['cluster_id'] for d in doc_dicts]

[0,
 0,
 0,
 0,
 1,
 1,
 1,
 10,
 10,
 100,
 100,
 101,
 102,
 102,
 103,
 103,
 104,
 104,
 105,
 106,
 106,
 106,
 106,
 107,
 107,
 108,
 108,
 109,
 11,
 11,
 11,
 11,
 110,
 110,
 111,
 111,
 112,
 112,
 112,
 112,
 113,
 113,
 114,
 114,
 115,
 115,
 115,
 116,
 116,
 117,
 117,
 117,
 117,
 118,
 118,
 119,
 119,
 119,
 119,
 119,
 119,
 119,
 119,
 12,
 12,
 120,
 120,
 121,
 121,
 122,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 123,
 124,
 124,
 125,
 126,
 126,
 127,
 127,
 127,
 128,
 129,
 129,
 13,
 13,
 130,
 130,
 131,
 131,
 132,
 132,
 133,
 133,
 134,
 134,
 135,
 135,
 135,
 135,
 135,
 135,
 135,
 135,
 136,
 136,
 137,
 137,
 138,
 138,
 139,
 139,
 14,
 14,
 14,
 14,
 140,
 140,
 140,
 140,
 140,
 140,
 140,
 140,
 141,
 141,
 142,
 143,
 143,
 143,
 144,
 145,
 145,
 146,
 146,
 147,
 147,
 148,
 148,
 149,
 149,
 15,
 15,
 15,
 15,
 150,
 150,
 151,
 151,
 151,
 151,
 151,
 151,
 151,
 151,
 151,
 151,
 152,
 152,


## Tokenization and normalization

Who knows the best way to do this? I tried to match the expectations of both the skip-thoughts code and the pythia codebase as best I could.

For each document:

1) Make list of sentences. We use utils.tokenize.punkt_sentences

2) Normalize each sentence. Remove html and make everything lower-case. We use utils.normalize.xml_normalize

3) Tokenize each sentence. Now each sentence is a string of space-separated tokens. We use utils.tokenize.word_punct_tokens and rejoin the tokens.




In [10]:
# Make list of sentences for each doc
sentenced = [tokenize.punkt_sentences(doc) for doc in docs]
# Normalize each sentence
normalized = [[normalize.xml_normalize(sentence) for sentence in doc] for doc in sentenced]
#Tokenize each sentence
tokenized = [[' '.join(tokenize.word_punct_tokens(sentence)) for sentence in doc] for doc in normalized]

## Category labels

Each document had a cluster_id. We use this cluster_id as the categorical label for each sentence.

We create a numpy array of shape num_sentences by (num_clusters + 1). The extra cluster is for the null sentences we will mention shortly.

In [22]:
target = []
for doc_index, cluster_id in enumerate(cluster_ids):
    num_sentences = len(tokenized[doc_index]) # The number of sentences in the current document
    for i in range(num_sentences + 1:
                  target.append([])

In [None]:
target = numpy.array

## An annoying issue

`tokenized` is now a list of lists. Each inner list represents a document as a list of strings, where each string represents a sentence.

The trainer expects a list of sentences. To match expectations, those inner brackets need to disappear.

However, this then looks like we have one real long document where the documents have been smashed together in arbitrary order. And the training will mistake the first sentence of one document as being part of the context of the last sentence of another. For sufficiently long documents, you can argue this is just noise. For documents that are themselves only a few sentences, this seems like too much noise.

My cludgy fix is to introduce a sentence consisting of a single null character `'\0'` and add this sentence between every document when concatenating. This may have unintended side-effects.

In [11]:
# We combine all documents with a special character in between string

doc_separator = '\0'
separated = sum(zip(tokenized,[[doc_separator]]*len(tokenized)),tuple())
sentences = sum(separated,[])

In [12]:
for x in sentences[:100]:
    print(x,'\n')

why does an alchemist need a transmutation circle ? 

as a follow up , but another subject , of my question about equivalent exchange part of the alchemy laws . 

why do ( most ) alchemists require a transmutation circle ? 

does any circle suffice or does an alchemist require a specific type for each ( type of ) job ? 

at least the size seems to matter ... 

  

how does the equivalent exchange of alchemy work ? 

alchemy , in full metal alchemist , is based on the concept of ' one needs to provide materials of equal value compared to the thing one want to create ' ( equivalent exchange ). 

but , how does it exactly work . 

can this amount of required materials be calculated by the alchemist ? 

if so , how ? 

are there some sort of lookup - tables ? 

or does one need to guess and provides something of more value to be on the safe side ? 

if so , can one become a more skilled alchemist by experience of successful guesses ? 

also , different alchemists might have different speci

In [13]:
for x in sentences[::100]:
    print(x,'\n')

why does an alchemist need a transmutation circle ? 

i ' m wondering if there are any official site for anime , maybe something like imdb ? 

they seem to have long time conversation among themselves isn ' t it ? 

is it then possible to say , that any humans that use the death note do not go to heaven nor hell because they go to the shinigami realm ? 

i really like the storyline , and i kinda miss it since the series hasn ' t released a new chapter in a long time . 

and can we possibly determine the duration of otonashi ' s stay in the afterlife world up till the last episode ? 

  

what happened in the final episode of the evangelion tv series , titled " take care of yourself "? 

during the substitute shinigami arc , it is covered that both ichigo and ginjo are substitute shinigami ( ginjo being and ex - substitute shinigami ). 

zeref tells natsu to choose between life and death . 

then people with pk caused a war and for that reason , human population suffered a terrible decr

In [14]:
# wordcount the count of words, ordered by appearance in text
# worddict 
worddict, wordcount = vocab.build_dictionary(sentences)

In [15]:
vocab.save_dictionary(worddict, wordcount, vocab_location)

In [16]:
params = dict(
    saveto = model_location,
    dictionary = vocab_location,
    n_words = 1000,
    dim_word = 100,
    dim = 500,
    max_epochs = 1,
    saveFreq = 100,
    )

In [None]:
train.trainer(sentences,**params)

In [None]:
model = tools.load_model()

In [None]:
tools.encode(model,sentences)

In [None]:
skipthoughts.encode(model,sentences)

## How to evaluate?

Supervised task. Apply cluster_id as label to each sentence. Run regression. Evaluate performance 

In [None]:
regression = LinearRegression()

In [None]:
regression.fit()