#Pipeline for classifiying texts

The classifier expects input text files of containing:  
`sentence id[tab]sentence[tab]None`  
etc. 

The sentences should be tokenized, and tokens should be separated by a space.

It is best to have a single file for each text for which labels should be predicted.

The text files should be put together in a single directory.

Use notebook [01_CreateDataForPrediction](01_CreateDataForPrediction.ipynb) to generate data in the correct format.

In [2]:
# Annotation

# path to the input data
data_dir = '/home/jvdzwaan/data/embem/txt/annotation-for_prediction-normalized/'

# specify the path where output should be written
out_dir = '/home/jvdzwaan/data/embem/txt/annotation-predicted-heem-normalized/'

In [2]:
# Corpus big

# path to the input data
data_dir = '/home/jvdzwaan/data/embem/txt/corpus_big-for_prediction-normalized/'

# specify the path where output should be written
out_dir = '/home/jvdzwaan/data/embem/txt/corpus_big-predicted-heem-normalized/'

In [5]:
# Ceneton data

# path to the input data
data_dir = '/home/jvdzwaan/data/embem/txt/ceneton-for_prediction-normalized/'

# specify the path where output should be written
out_dir = '/home/jvdzwaan/data/embem/txt/ceneton-predicted-heem-normalized/'

In [8]:
# EDBO data

# path to the input data
data_dir = '/home/jvdzwaan/data/embem/txt/edbo-for_prediction-normalized/'

# specify the path where output should be written
out_dir = '/home/jvdzwaan/data/embem/txt/edbo-predicted-heem-normalized/'

In [3]:
import os

if not os.path.exists(out_dir):
    os.makedirs(out_dir)

# classifier file
classifier = '/home/jvdzwaan/data/classifier/classifier.pkl'

# train file
train_file = '/home/jvdzwaan/data/embem_ml/multilabel-normalized/all.txt'

In [5]:
from sklearn.externals import joblib
import codecs
from utils import get_data, load_data

# load classifier
clf = joblib.load(classifier)

text_files = [fi for fi in os.listdir(data_dir) if fi.endswith('.txt')]
for i, text_file in enumerate(text_files):
    in_file = os.path.join(data_dir, text_file)
    print('({} of {}) {}'.format(i+1, len(text_files), text_file))

    # load data
    X_train, X_data, Y_train, Y_data, classes_ = get_data(train_file, in_file)

    # classifiy
    pred = clf.predict(X_data)

    # save results
    out_file = os.path.join(out_dir, text_file)

    X_data_with_ids, Y_data = load_data(in_file)

    with codecs.open(out_file, 'wb', 'utf8') as f:
        for x, y in zip(X_data_with_ids, pred):
            f.write(u'{}\t{}\n'.format(x.decode('utf8'),
                                       '_'.join(classes_[y]) or 'None'))

(1 of 29) vond001gysb04.txt
(2 of 29) ross006zing01.txt
(3 of 29) huyd001achi01.txt
(4 of 29) hoof001gran01.txt
(5 of 29) stee033adag01.txt
(6 of 29) rivi001jeug01.txt
(7 of 29) fres003pefr01.txt
(8 of 29) bidl001nede01.txt
(9 of 29) hare003agon01.txt
(10 of 29) alew001puit01.txt
(11 of 29) hoof001achi01.txt
(12 of 29) lijn002vlug01.txt
(13 of 29) bred001moor01.txt
(14 of 29) stee033tham01.txt
(15 of 29) alew001besl01.txt
(16 of 29) vinc001pefr02.txt
(17 of 29) bren001scha01.txt
(18 of 29) lang020chph01.txt
(19 of 29) bren001goud01.txt
(20 of 29) meij001verl01.txt
(21 of 29) vond001jose05.txt
(22 of 29) vond001pala01.txt
(23 of 29) rivi001vero01.txt
(24 of 29) pels001verw02.txt
(25 of 29) focq001mini02.txt
(26 of 29) noms001mich01.txt
(27 of 29) weye002holl01.txt
(28 of 29) vos_002kluc01.txt
(29 of 29) ling001ontd01.txt


In [6]:
# make unnormalized version of predicted labels (needed before expanding body part labels)

%run merge_data_and_labels.py /home/jvdzwaan/data/embem/txt/annotation-predicted-heem-normalized/ /home/jvdzwaan/data/embem/txt/annotation-for_prediction/ /home/jvdzwaan/data/embem/txt/annotation-predicted-heem
#%run merge_data_and_labels.py /home/jvdzwaan/data/embem/txt/corpus_big-predicted-heem-normalized/ /home/jvdzwaan/data/embem/txt/corpus_big-for_prediction/ /home/jvdzwaan/data/embem/txt/corpus_big-predicted-heem
#%run merge_data_and_labels.py /home/jvdzwaan/data/embem/txt/ceneton-predicted-heem-normalized/ /home/jvdzwaan/data/embem/txt/ceneton-for_prediction/ /home/jvdzwaan/data/embem/txt/ceneton-predicted-heem
#%run merge_data_and_labels.py /home/jvdzwaan/data/embem/txt/edbo-predicted-heem-normalized/ /home/jvdzwaan/data/embem/txt/edbo-for_prediction/ /home/jvdzwaan/data/embem/txt/edbo-predicted-heem

(1 of 29) vond001gysb04.txt
(2 of 29) ross006zing01.txt
(3 of 29) huyd001achi01.txt
(4 of 29) hoof001gran01.txt
(5 of 29) stee033adag01.txt
(6 of 29) rivi001jeug01.txt
(7 of 29) fres003pefr01.txt
(8 of 29) bidl001nede01.txt
(9 of 29) hare003agon01.txt
(10 of 29) alew001puit01.txt
(11 of 29) hoof001achi01.txt
(12 of 29) lijn002vlug01.txt
(13 of 29) bred001moor01.txt
(14 of 29) stee033tham01.txt
(15 of 29) alew001besl01.txt
(16 of 29) vinc001pefr02.txt
(17 of 29) bren001scha01.txt
(18 of 29) lang020chph01.txt
(19 of 29) bren001goud01.txt
(20 of 29) meij001verl01.txt
(21 of 29) vond001jose05.txt
(22 of 29) vond001pala01.txt
(23 of 29) rivi001vero01.txt
(24 of 29) pels001verw02.txt
(25 of 29) focq001mini02.txt
(26 of 29) noms001mich01.txt
(27 of 29) weye002holl01.txt
(28 of 29) vos_002kluc01.txt
(29 of 29) ling001ontd01.txt


In [7]:
# Expand body parts

%run classify_body_parts.py /home/jvdzwaan/data/embem/dict/body_part_mapping.json /home/jvdzwaan/data/embem/txt/annotation-predicted-heem/ /home/jvdzwaan/data/embem/txt/annotation-predicted-heem-expanded_body_parts  /home/jvdzwaan/data/embem/dict/annotation_heem_expanded_body_parts.csv
#%run classify_body_parts.py /home/jvdzwaan/data/embem/dict/body_part_mapping.json /home/jvdzwaan/data/embem/txt/corpus_big-predicted-heem/ /home/jvdzwaan/data/embem/txt/corpus_big-predicted-heem-expanded_body_parts  /home/jvdzwaan/data/embem/dict/corpus_big_heem_expanded_body_parts.csv
#%run classify_body_parts.py /home/jvdzwaan/data/embem/dict/body_part_mapping.json /home/jvdzwaan/data/embem/txt/ceneton-predicted-heem/ /home/jvdzwaan/data/embem/txt/ceneton-predicted-heem-expanded_body_parts  /home/jvdzwaan/data/embem/dict/ceneton_heem_expanded_body_parts.csv
#%run classify_body_parts.py /home/jvdzwaan/data/embem/dict/body_part_mapping.json /home/jvdzwaan/data/embem/txt/edbo-predicted-heem/ /home/jvdzwaan/data/embem/txt/edbo-predicted-heem-expanded_body_parts  /home/jvdzwaan/data/embem/dict/edbo_heem_expanded_body_parts.csv

ignored: rose-kaken (cheeks)


  if w in word2cat.keys():


The next step is to look at the results!

_To do: pipeline for showing/visualizing results_