# Data Step 1: Processing Feature Files for Bookworm

This notebook runs through Extracted Features files, saving:

1. Global token counts (by language) toward the eventual Bookworm Wordlist. 
    These aren't all folded here: rather, they are folded by batch and saved to and HDF5 Store.
    Later, they'll all be folded into one big list.

2. "Raw" unigram counts per book. These will eventually be trimmed to only the BW vocabulary and
    labelled by an id. This information first needs the wordlist that #1 above will create, but
    since we're already opening the EF files, might as well do some processing and save this
    intermediate state to a fast IO format (HDF5 store, again).

In [None]:
from htrc_features import FeatureReader, utils
import pandas as pd
from tqdm import tqdm_notebook # Progress bars!
from ipyparallel import Client
import numpy as np
import logging

Before attaching to ipyparallel engines, they need to be started with 

```bash
    ipcluster start -n NUM
```

In [None]:
rc = Client()
dview = rc[:]
v = rc.load_balanced_view()

Initialize logging. There's no nice way to pass logs between engines, so just give each one its own log.

The timestamp format is designed for easy sort, so you can track all logs with 

```bash
watch "tail -q -n 100 logs/* | sort"
```

In [None]:
def init_log(name=False):
    import logging, os
    if not name:
        name = os.getpid()
    handler = logging.FileHandler("/notebooks/data/logs/bw-%s.log" % name, 'a')
    formatter = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s', "%m/%d-%H:%M:%S")
    handler.setFormatter(formatter)
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.addHandler(handler)
    logging.info("Log initialized")

dview.push(dict(init_log=init_log))
%px init_log()
init_log("root")

Load paths to feature files. This notebook maintains a list of successully processed ids, so there are some functions that help us cross reference all volumes with done volumes.

In [None]:
with open("/notebooks/features/listing/pd-file-listing.txt", "r") as f:
    paths = ["/notebooks/features/"+path.strip() for path in f.readlines()][1:]
    print("Number of texts", len(paths))

def get_processed():
    ''' Get already processed files. Wrapped in func for easy refresh'''
    try:
        with open("successful-counts.txt", "r") as f:
            paths = f.read().strip().split("\n")
        paths = ["/notebooks/features/"+utils.id_to_rsync(path) for path in paths]
        return np.array(paths)
    except:
        return np.array([])

path_to_id = lambda x: x.replace(".json.bz2", "").split("/")[-1]

Number of texts 4805430


`get_count` is the function that does the processing of the volume. To improve performance, however, the subprocesses run larger volumes in larger batches with `get_doc_counts`.

In [None]:
def trim_token(t, max=50):
    ''' Trim unicode string to max number of bytes'''
    if len(t.encode('utf-8')) > max:
        while len(t.encode('utf-8')) > max:
            t = t[:-1]
    return t

def get_count(path, store=False):
    ''' Get tokencount information from a single doc, by path'''
    from htrc_features import FeatureReader    
    max_char = 50
    vol = FeatureReader(path).first()
    tl = vol.tokenlist(pages=False, pos=False)
    if tl.empty:
        return tl
    else:
        tl = tl.reset_index('section')[['count']]
    tl.index = [trim_token(t, max_char) for t in tl.index.values]
    tl.index.names=['token']
    tl['id'] = vol.id
    tl['language'] = vol.language
    tl = tl.reset_index('token').set_index(['language', 'id', 'token']).sort_index()
    return tl

# Send to Engines
dview.push(dict(trim_token=trim_token, get_count=get_count))

# Example
get_count(paths[0]).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
language,id,token,Unnamed: 3_level_1
eng,ufl2.uf00100909_00001,!,1
eng,ufl2.uf00100909_00001,"""",397
eng,ufl2.uf00100909_00001,"""""2",1


In [None]:
def get_doc_counts(paths, mincount=2, max_str_bytes = 50):
    '''
    This method lets you process multiple paths at a time on a single engine.
    This means the engine can collect enough texts to do a simple filter (i.e. >X counts in Y texts)
    and can save to it's own store.
    '''
    import logging
    import os
    import pandas as pd
    fname = '/notebooks/data/stores/bw_counts_%s.h5' % os.getpid()
    success_log = []
    logging.info("Starting %d volume batch on PID=%s" % (len(paths), os.getpid()))
    with pd.HDFStore(fname, mode="a", complevel=9, complib='blosc') as store:
        tl_collector = []
        for path in paths:
            try:
                tl = get_count(path, store=store)
                if tl.empty:
                    continue
                tl_collector.append(tl)
            except:
                logging.exception("Unable to get count for path %s" % path)
                continue
            success_log.append(path)

        # Save a DF combining all the counts from this batch
        try:
            logging.info("Merging and Saving texts for %d paths starting with %s" % (len(paths), paths[0]))
            combineddf = pd.concat(tl_collector)
            
            # Save tf(doc) with volid but no lang
            # For efficient HDF5 storage, enforcing a 50 byte token limit. Can't use
            # DataFrame.str.slice(stop=50) though, because really we care about bytes and 
            # some unicode chars are multiple codepoints.
            # volids are capped at 25chars (the longest PD vol id)
            store.append('tf/docs',
                         combineddf.reset_index('language')[['count']],
                         min_itemsize = {'id': 25, 'token':max_str_bytes})
            
            ### Save tf(corpus)
            df = combineddf.groupby(level=['language', 'token'])[['count']]\
                           .sum().sort_index()
            # Filtering this way (by corpus total, not language total) is too slow:
            #if mincount:
            #    df = df.groupby(level='token')[['count']].filter(lambda x: x.sum()>=mincount)
            # Because we can't feasibly filter on total count and have to do so by lang x token, it
            # might unfairly punish sparse languages. My workaround is to only even trim English by
            # mincount: any bias this would have would be in the bottom of the wordlist anyway.
            if mincount:
                df = df[(df.index.get_level_values(0) != 'eng') | (df['count']>2)]
            store.append('tf/corpus', df, min_itemsize = {'token': max_str_bytes})
            tl_collector = dict()
            return success_log
        except:
            logging.exception("Saving error for %d paths starting with %s" % (len(paths), paths[0]))
            return []
    return paths

In [None]:
import time
# Split paths into N-sized chunks, so engines can iterate on multiple texts at once
chunk_size = 400
i = 0
# To avoid human error resulting in duplicated or missed files, simply trim the 
# Path list when this cell is run
remaining_paths = np.setdiff1d(paths, get_processed())
chunked_paths = [remaining_paths[i:i+chunk_size] for i in range(0, len(remaining_paths), chunk_size)]
n = 20
start = 0

starttime = time.time()
logging.info("Starting parallel job")
parallel_job = v.map(get_doc_counts, chunked_paths[start:start+n], ordered=False)

for result in tqdm_notebook(parallel_job, smoothing=0):
    i += 1
    if result:
        with open("/notebooks/data/successful-counts.txt", "a+") as f:
            ids = [path_to_id(path) for path in result]
            f.write("\n".join(ids)+"\n")
        logging.info("Done processing batch %d, from %s to %s" % (i, result[0], result[-1]))
    else:
        logging.error("Problem with result in batch %d" % i)

logging.info("Done")
logging.info(time.time()-starttime)

INFO:root:Starting parallel job
INFO:root:Done processing batch 1, from /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0n/s1/nr/92/ark+=13960=t0ns1nr92/aeu.ark+=13960=t0ns1nr92.json.bz2 to /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0q/r5/5f/0p/ark+=13960=t0qr55f0p/aeu.ark+=13960=t0qr55f0p.json.bz2
INFO:root:Done processing batch 2, from /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0r/r2/vq/0q/ark+=13960=t0rr2vq0q/aeu.ark+=13960=t0rr2vq0q.json.bz2 to /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0t/q6/g5/0s/ark+=13960=t0tq6g50s/aeu.ark+=13960=t0tq6g50s.json.bz2
INFO:root:Done processing batch 3, from /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0g/t6/pr/0t/ark+=13960=t0gt6pr0t/aeu.ark+=13960=t0gt6pr0t.json.bz2 to /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0j/t0/84/42/ark+=13960=t0jt08442/aeu.ark+=13960=t0jt08442.json.bz2
INFO:root:Done processing batch 4, from /notebooks/features/aeu/pairtree_root/ar/k+/=1/39/60/=t/0c/v5




In [None]:
import glob
stores = glob.glob("/notebooks/data/stores/*h5")
for storepath in stores:
    with pd.HDFStore(storepath, mode='r', complevel=9, complib='blosc') as store:
        print(store)

<class 'pandas.io.pytables.HDFStore'>
File path: /notebooks/data/stores/bw_counts_75.h5
/tf/corpus            frame_table  (typ->appendable_multi,nrows->880777,ncols->3,indexers->[index],dc->[token,language])
/tf/docs              frame_table  (typ->appendable_multi,nrows->3295309,ncols->3,indexers->[index],dc->[token,id])     
<class 'pandas.io.pytables.HDFStore'>
File path: /notebooks/data/stores/bw_counts_81.h5
/tf/corpus            frame_table  (typ->appendable_multi,nrows->560487,ncols->3,indexers->[index],dc->[token,language])
/tf/docs              frame_table  (typ->appendable_multi,nrows->4016176,ncols->3,indexers->[index],dc->[token,id])     
<class 'pandas.io.pytables.HDFStore'>
File path: /notebooks/data/stores/bw_counts_92.h5
/tf/corpus            frame_table  (typ->appendable_multi,nrows->734163,ncols->3,indexers->[index],dc->[token,language])
/tf/docs              frame_table  (typ->appendable_multi,nrows->4848439,ncols->3,indexers->[index],dc->[token,id])     
<class 'pa

## Todo

- Check for duplicates in "successful-counts.txt". I caught one text duplicated due to a bug, good to check that it doesn't happen again.
- Is the uint32 size enough (0 - 4294967295) once I start merging? **NO** - `the` had 35 billion occurances in the PD dataset. 
- Create a table index after storage (e.g. `store.create_table_index('df', optlevel=9, kind='full')`)

## Notes
If a word doesn't occur at least twice in a batch of 400 texts (i.e. 2/180m words), it is trimmed from that batch. This cuts the size considerably.

## Timing

50/chunk, 20 batches: 5.40s/it / 1m48 for 1000texts
50/chunk, 20 batches, no doc saving: 4.68s/it / 1m33s for 1000texts (seems fine to keep around?)