In [1]:
%%HTML
<style> code {background-color : lightgrey !important;} </style>

## `hashcorpus` tutorial

The `hashcorpus` library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).

The library requires as input:

* a tokenized corpus as a nested list, whose elements are themselves nested lists of the tokens of each document in the corpus

    each list corresponds to a document structure: its chapters, paragraphs, sentences. you decide how the nested list is to be created or structured, as long as the input is a nested list with strings as their bottom-most elements.

* `corpus_path`, a path to a directory where the output files are to be stored

The output includes:

* a .json file for every item in the dictionary provided, named sequencially as positive integers, e.g., the first document being `0.json`, stored in `corpus_path/public/$(timestamp-of-hash)/`

* two pickled dictionaries stored in `corpus_path/private`. they are used to decode the .json files or the NLP results

## Preparing the input data

loading libraries...

In [2]:
import os
import json
from nltk.corpus import gutenberg
import corpushash as ch
import base64
import hashlib
import random

we'll use the gutenberg corpus as test data, which is available through the nltk library.

downloading test data (if needed):

In [3]:
import nltk
#nltk.download('gutenberg')  # comment (uncomment) if you have (don't have) the data

files in test data:

In [4]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

creating test corpus path, where hashed documents will be stored as .json files:

In [5]:
base_path = os.getcwd()
base_path

'/home/guest/Documents/git/hashed-nlp'

In [6]:
corpus_path = os.path.join(base_path, 'guten_test')
corpus_path

'/home/guest/Documents/git/hashed-nlp/guten_test'

#### function to split text into nested list:

In [7]:
excerpt = gutenberg.raw('austen-emma.txt')[50:478]
print(excerpt)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. 


every paragraph and sentence is its own list:

In [8]:
print(ch.text_split(excerpt))

[[['Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home']], [['and', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings']], [['of', 'existence', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world']], [['with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her']], [['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate']], [['indulgent', 'father', 'and', 'had', 'in', 'consequence', 'of', 'her', "sister's", 'marriage']], [['been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period']]]


## Input

the library takes as input a nested list whose elements are the original documents as nested lists. this can be an in-nested list or some generator that yields a nested list when it is iterated over.

#### creating nested list made from the raw texts in the gutenberg corpus:

In [9]:
%%time
guten_list = []
for document_name in gutenberg.fileids():
    document = gutenberg.raw(document_name)
    split_document = ch.text_split(document)
    guten_list.append(split_document)

CPU times: user 1.1 s, sys: 22 ms, total: 1.13 s
Wall time: 2.16 s


excerpt:

In [12]:
document = random.choice(guten_list)
print(document[:10])

[[['Persuasion', 'by', 'Jane', 'Austen', '1818']], [['Chapter', '1']], [['Sir', 'Walter', 'Elliot', 'of', 'Kellynch', 'Hall', 'in', 'Somersetshire', 'was', 'a', 'man', 'who']], [['for', 'his', 'own', 'amusement', 'never', 'took', 'up', 'any', 'book', 'but', 'the', 'Baronetage']], [['there', 'he', 'found', 'occupation', 'for', 'an', 'idle', 'hour', 'and', 'consolation', 'in', 'a']], [['distressed', 'one', 'there', 'his', 'faculties', 'were', 'roused', 'into', 'admiration', 'and']], [['respect', 'by', 'contemplating', 'the', 'limited', 'remnant', 'of', 'the', 'earliest', 'patents']], [['there', 'any', 'unwelcome', 'sensations', 'arising', 'from', 'domestic', 'affairs']], [['changed', 'naturally', 'into', 'pity', 'and', 'contempt', 'as', 'he', 'turned', 'over']], [['the', 'almost', 'endless', 'creations', 'of', 'the', 'last', 'century', 'and', 'there']]]


## processing using `corpushash`

#### instantiating `CorpusHash` class, which hashes the provided corpus to the `corpus_path`:

In [14]:
%time hashed_guten = ch.CorpusHash(guten_list, corpus_path)

INFO:root:dictionaries from previous hashing found. loading them.
INFO:root:18 documents hashed and saved to /home/guest/Documents/git/hashed-nlp/guten_test/public/2017-05-03_18-48-49.


CPU times: user 6.28 s, sys: 114 ms, total: 6.39 s
Wall time: 6.41 s


## Output

### Encode dictionary

The encode dictionary is used to encode values to hashes, so that the same strings are guaranteed to be hashed to the same value, including the random salt.

In [19]:
entries = random.sample(list(hashed_guten.encode_dictionary.keys()), k=5)
for entry in entries:
    print("token >> {:^20} | hashed_token >> '{}'".format(entry, hashed_guten.encode_dictionary[entry]))

token >>      endowment       | hashed_token >> 'H>tM?J1bYkuFd9#?v5@tO+74feW<f8PAkJRk+Be{'
token >>       Blessed        | hashed_token >> '$0Qjsn8sFljaR<2GSbk39MFjb1j-ccN@H$v!s0M-'
token >>       which--        | hashed_token >> 'sAPlSL3U0sI<87=772r{Fktzw#Mc9z^PfbKMC{QN'
token >>        paste         | hashed_token >> 'PKQsysD4vL>lH>4E6K^{rdT}`?J+OhTUTsHnhhyH'
token >>      thrashers       | hashed_token >> 'Mmnfz17U-dzhax0W&~P}T<gi&(`FZjdvULUo6Wr?'


### Decode dictionary

The decode dictionary is used to decode hashes to their original strings, so that one can make sense of the results of any posterior NLP analysis. It also lists the salt that was used to obtain the given hash. This dictionary must be kept secret with the owner of the data.

In [16]:
entries = random.sample(list(hashed_guten.decode_dictionary.keys()), k=5)
for entry in entries:
    print("hashed_token >> '{}' | (token >> '{}', salt >> '{}'".format(entry, hashed_guten.decode_dictionary[entry][0], hashed_guten.decode_dictionary[entry][1][:4]))  # cutting off some bytes for aesthetic reasons

hashed_token >> 'w8t4S<2`PpQPasQOaop9LiW2(I$74nx8#~dSBeW=' | (token >> 'rout', salt >> 'b'\xe9\x86F\xb3''
hashed_token >> 'o8J*^aM0OG5Zd;P-xK!K8Vkq0$;?!-3`SI^7M$9^' | (token >> 'Darknesse', salt >> 'b'O\x1c:\x0c''
hashed_token >> 'm8TcJZPk;F%`s7q<aZ%9SSKB1s0B^ZL7<NV@vI?&' | (token >> 'ayres', salt >> 'b'8\x14\xa8R''
hashed_token >> 'a8?UumsR@P7T=0gZ1{l=_H?&$BWX^U6x|>9PhSh+' | (token >> 'foremen', salt >> 'b'\xc8\x82\xb1\x14''
hashed_token >> 'o1?QpYJrk?59R~l4v55I?ER0RkOIQBxMiearEH~$' | (token >> 'God-enfranchis'd', salt >> 'b'9\xcaj\x15''


### hashed .json files

the `walk_nested_list` function yields items in a nested list in order, regardless of their depth:

In [20]:
print(excerpt)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. 


In [21]:
for element in ch.walk_nested_list(ch.text_split(excerpt)):
    print(element)

Emma
Woodhouse
handsome
clever
and
rich
with
a
comfortable
home
and
happy
disposition
seemed
to
unite
some
of
the
best
blessings
of
existence
and
had
lived
nearly
twenty-one
years
in
the
world
with
very
little
to
distress
or
vex
her
She
was
the
youngest
of
the
two
daughters
of
a
most
affectionate
indulgent
father
and
had
in
consequence
of
her
sister's
marriage
been
mistress
of
his
house
from
a
very
early
period


#### we can use this function to see what hashcodecs has done to the corpus.

adjusting parameters:

In [28]:
limit = 10  # showing first ten entries
document = random.randint(0, len(gutenberg.fileids()))
print('document {} corresponds to {}.'.format(document, gutenberg.fileids()[document]))

document 13 corresponds to milton-paradise.txt.


**note**: take care to have your corpus yield the documents always in the same order, else you'll have a harder time knowing which hashed documents correspond to which documents.

reading output .json as a nested list:

In [30]:
document_path = os.path.join(hashed_guten.public_path, "{}.json".format(document))
with open(document_path, mode="rt") as fp:
    encoded_document = json.load(fp)

In [31]:
print("original token >> encoded token")
for ix, tokens in enumerate(zip(ch.walk_nested_list(guten_list[document]), ch.walk_nested_list(encoded_document))):
    print("'{}' >> '{}'".format(tokens[0], tokens[1]))
    if ix > limit:
        break

original token >> encoded token
'Paradise' >> 'E{9&7z%En+)8vO?$5C$);e3%QIn^7#2grDHGp8m{'
'Lost' >> 'l_Z6-kd~m;D&`WBVNoXiaC)f%nhzfn!ym>O*NdQR'
'by' >> 'RF#n3YIWY99SilI=6|%A7{kxOG!PAmR6b_^l|pmM'
'John' >> 'J<uL5e0VMjgK2XOg6nzisq|A>Vs4^Eivwi}E`v@6'
'Milton' >> 'U13f5tX+fe2n6b~U4AbMLyIDH1!hDa<M-xICww!>'
'1667' >> 'j7bZ4ME;O5o-bgKEhX4g@RQcE`bYvl6zGh#kP)a)'
'Book' >> 'a!_YTTzXY!CHK4!q>x!6Fz+HK&ZE|M!Zv(1yc1{S'
'I' >> '>-^j%YNz_lr{|A_&MnGv95)G)8TETmZhND{*ET#n'
'Of' >> 'wsBz1B>a^hUeX>GZvJJV;5F;)5$!DOM}Yc;bm<yD'
'Man's' >> 'EINDPhg<|Ehf%*WeiOR%B2eI+jhu|1?*y~w5|NN&'
'first' >> '*Wbi}j`5YHU0X!jD&Ojr=^$)h30TCN!{9f8ZoN3n'
'disobedience' >> '>U!C}R!}8QFpYX`F@$MR(F(T+9Cu1kd6{%?DM+$?'


alternatively, one can check the `corpus_path` directory and read the output files using one's favorite text editor.

the path is:

In [32]:
hashed_guten.public_path

'/home/guest/Documents/git/hashed-nlp/guten_test/public/2017-05-03_18-48-49'

## Advanced

### how this library works

this is the basic algorithm for the `corpushash` library. for more details, check the source code, it is readable.

- Create an empty corpus structure (nested list) to hold the hashed tokens;
- Create a decoding dictionary: a list of key, value pairs where the key is an encoded token (hash) and the values are the unhashed token and its salt.
- Create an encoding dictionary: a list of key, value pairs where the key is a plain token and the value is its cryptographic hash.
- Iterate over the unhashed tokens
    - Check if the word is in the encoding dictionary;
    - If so, add its hash value to the hashed tokens list
    - If not, hash it with the addition of a random salt, and add them to the encoding and decoding dictionaries.
- Return the hashed corpus and the dictionaries

### optional arguments

`hash_function`: default is sha-256, but can be any hash function offered by the hashlib library that does not need additional parameters (as does scrypt, for instance)
`salt_length`: determines salt length in bytes, default is 32 bytes.
`one_salt`: determines if tokens will be hashed with the same salt or one for each token. if True, os.urandom generates a salt to be used in all hashings. (**note**: in this case, choose a greater salt length). if False, os.urandom will generate a salt for each token.
`encoding`: determines the encoding of the outputted .json files. default is utf-8, and you probably want to keep it that way.

### hashing documents from the same corpus at different times

hashing documents from the same corpus at different times is supported in case you produce documents continuously.

by specifying a `corpus_path` from a previous instance of `CorpusHash` to this new instance, it'll automatically search for and employ the same dictionaries used in the last hashing, which means the same tokens in the old and new documents will map to the same hash.

the files will be saved to a new directory in the `public/` directory, named after the timestamp of this instance of `CorpusHash`.

**note**: be careful when specifying a previously used `corpus_path` if the above is not what you want to do. if you want new dictionaries, just specify a new `corpus_path`.

**note**: when specifying a previously used `corpus_path`, take care with the optional arguments of `CorpusHash`. 

- specifying a different `hash_function` will cause the hashes of the same words in each instance to differ. 

- if you set `one_salt` to `True`, the library will assume this was also `True` for the previous instances of `CorpusHash`, and will take an arbitrary salt from the `decode_dictionary` as the salt to be used in this instance -- they should all be the same, after all. if `one_salt` was not `True` for the previous instances, this will produce unexpected results.
    - additionally, if you pass `one_salt=True` in this situation with any value to `salt_length`, this value will be ignored.
    - if you pass `one_salt=True` in this situation with a different value than the last to `hash_function`, any token not hashed in the previous instances will be hashed with a different hash function, which may mean a different hash length, for example. the tokens that were hashed in the previous instances will take the same value as before, because they are looked up in the `encode_dictionary` and not re-hashed.