# Clustering the twitter samples corpus

**corpushash** is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks). 

its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:

```python
import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"
```
**NLP is done, and `results` are in**:
```python
for token in results:
    print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"

```

#### loading libraries

In [86]:
import gensim
import logging, bz2, os
from corpushash import CorpusHash
from nltk.corpus import twitter_samples as tt
import numpy as np
import string
from gensim import corpora, models, similarities

In [87]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#### uncomment this if you don't have the corpus downloaded

In [88]:
#import nltk
#nltk.download('twitter_samples')

specify the directory you'd like to save files to:

In [89]:
path = os.getcwd()

this is needed because `gensim`'s `doc2bow` has some random behaviour:

In [90]:
np.random.seed(42)

### the twitter samples corpus

this is how the corpus looks like:

In [91]:
tt.strings()[:10]

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(',
 "oh god, my babies' faces :( https://t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http://t.co/XvmTUikWln',
 'why?:("@tahuodyy: sialan:( https://t.co/Hv1i0xcrL2"',
 'Athabasca glacier was there in #1948 :-( #athabasca #glacier #jasper #jaspernationalpark #alberta #explorealberta #… http://t.co/dZZdqmf7Cz']

but we'll be using the pre-tokenized version:

In [92]:
tt.tokenized()[0]

['hopeless', 'for', 'tmr', ':(']

In [93]:
len(tt.tokenized())

30000

In [94]:
decoded_twitter = tt.tokenized()

#### building gensim dictionary

from this document generator gensim will build a dictionary that maps every hashed token to an ID, a mapping which is later used to calculate the tf-idf weights:

In [95]:
id2word = gensim.corpora.Dictionary(decoded_twitter)

2017-05-23 19:20:43,589 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-05-23 19:20:43,775 : INFO : adding document #10000 to Dictionary(24343 unique tokens: ['hopeless', 'for', 'tmr', ':(', 'Everything']...)
2017-05-23 19:20:44,054 : INFO : adding document #20000 to Dictionary(35614 unique tokens: ['hopeless', 'for', 'tmr', ':(', 'Everything']...)
2017-05-23 19:20:44,343 : INFO : built Dictionary(42532 unique tokens: ['hopeless', 'for', 'tmr', ':(', 'Everything']...) from 30000 documents (total 580322 corpus positions)


In [96]:
id2word[0]

'hopeless'

#### bag-of-words

to build a tf-idf model, the gensim library needs an input that yields this vectorized bag-of-words when iterated over:

In [97]:
mm = [id2word.doc2bow(text) for text in decoded_twitter]
gensim.corpora.MmCorpus.serialize(os.path.join(path, 'twitter_pt_tfidf.mm'), mm)

2017-05-23 19:20:44,984 : INFO : storing corpus in Matrix Market format to /home/guest/Documents/git/corpushash/twitter_pt_tfidf.mm
2017-05-23 19:20:44,987 : INFO : saving sparse matrix to /home/guest/Documents/git/corpushash/twitter_pt_tfidf.mm
2017-05-23 19:20:44,987 : INFO : PROGRESS: saving document #0
2017-05-23 19:20:45,005 : INFO : PROGRESS: saving document #1000
2017-05-23 19:20:45,023 : INFO : PROGRESS: saving document #2000
2017-05-23 19:20:45,041 : INFO : PROGRESS: saving document #3000
2017-05-23 19:20:45,058 : INFO : PROGRESS: saving document #4000
2017-05-23 19:20:45,076 : INFO : PROGRESS: saving document #5000
2017-05-23 19:20:45,094 : INFO : PROGRESS: saving document #6000
2017-05-23 19:20:45,112 : INFO : PROGRESS: saving document #7000
2017-05-23 19:20:45,130 : INFO : PROGRESS: saving document #8000
2017-05-23 19:20:45,149 : INFO : PROGRESS: saving document #9000
2017-05-23 19:20:45,168 : INFO : PROGRESS: saving document #10000
2017-05-23 19:20:45,198 : INFO : PROGRESS

In [98]:
%%time
if os.path.exists(os.path.join(path, 'twitter_tfidf_model')):
    tfidf = models.TfidfModel.load(os.path.join(path, 'twitter_tfidf_model'))
else:
    tfidf = models.TfidfModel(mm)
    tfidf.save('twitter_tfidf_model')

2017-05-23 19:20:45,769 : INFO : collecting document frequencies
2017-05-23 19:20:45,771 : INFO : PROGRESS: processing document #0
2017-05-23 19:20:45,800 : INFO : PROGRESS: processing document #10000
2017-05-23 19:20:45,851 : INFO : PROGRESS: processing document #20000
2017-05-23 19:20:45,898 : INFO : calculating IDF weights for 30000 documents and 42531 features (538552 matrix non-zeros)
2017-05-23 19:20:45,919 : INFO : saving TfidfModel object under twitter_tfidf_model, separately None
2017-05-23 19:20:45,926 : INFO : saved twitter_tfidf_model


CPU times: user 153 ms, sys: 2 ms, total: 155 ms
Wall time: 158 ms


## Calculating the LSI model

The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.

In [99]:
def tfidf_corpus_stream(corpus):
    for doc in corpus:
        yield tfidf[doc]

In [100]:
tfidf_corpus_s = tfidf_corpus_stream(mm)

In [101]:
if os.path.exists(os.path.join(path, 'twitter_lsi_model')):
    lsi = gensim.models.LsiModel.load(os.path.join(path, 'twitter_lsi_model'))
else:
    lsi = gensim.models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=id2word, num_topics=100)
    lsi.save(os.path.join(path, 'twitter_lsi_model'))

2017-05-23 19:20:46,307 : INFO : using serial LSI version on this node
2017-05-23 19:20:46,308 : INFO : updating model with new documents
2017-05-23 19:20:46,722 : INFO : preparing a new chunk of documents
2017-05-23 19:20:46,818 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 19:20:46,819 : INFO : 1st phase: constructing (42532, 200) action matrix
2017-05-23 19:20:47,034 : INFO : orthonormalizing (42532, 200) action matrix
2017-05-23 19:20:49,605 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2017-05-23 19:20:50,298 : INFO : computing the final decomposition
2017-05-23 19:20:50,299 : INFO : keeping 100 factors (discarding 22.970% of energy spectrum)
2017-05-23 19:20:50,384 : INFO : processed documents up to #20000
2017-05-23 19:20:50,387 : INFO : topic #0(20.207): 0.246*""" + 0.207*"SNP" + 0.186*"Tories" + 0.174*"Miliband" + 0.171*"in" + 0.162*"is" + 0.160*"to" + 0.154*"of" + 0.146*"Sco" + 0.146*"…"
2017-05-23 19:20:50,389 : INFO : topic #1(17.533): 0.

In [102]:
for n in range(17):
    print("====================")
    print("Topic {}:".format(n))
    print("Coef.\t Token")
    print("--------------------")
    for tok,coef in lsi.show_topic(n):
        print("{:.3}\t{}".format(coef,tok))

Topic 0:
Coef.	 Token
--------------------
0.313	"
0.181	Tories
0.171	preoccupied
0.171	inequality
0.171	@Tommy_Colc
0.171	wrote
0.167	Miliband
0.167	claiming
0.166	w
0.164	man
Topic 1:
Coef.	 Token
--------------------
0.247	SNP
-0.21	"
0.178	Sco
0.177	to
0.176	protect
0.176	lots
0.175	definitely
0.172	@NicolaSturgeon
0.172	rather
0.17	let
Topic 2:
Coef.	 Token
--------------------
0.194	the
0.175	.
-0.169	protect
-0.169	lots
-0.168	definitely
-0.168	Sco
0.159	a
0.159	%
0.148	I
-0.147	MPs
Topic 3:
Coef.	 Token
--------------------
-0.646	%
-0.319	-
-0.266	(
-0.22	)
-0.195	1
-0.131	CON
-0.13	LAB
-0.127	poll
-0.126	8
-0.125	34
Topic 4:
Coef.	 Token
--------------------
-0.236	thus
-0.236	ahem
-0.236	@thomasmessenger
-0.236	http://t.co/DkLwCwzhDA
-0.235	financial
-0.234	caused
-0.233	global
-0.233	crisis
-0.225	For
-0.225	overspent
Topic 5:
Coef.	 Token
--------------------
0.341	FT
0.321	(
-0.232	%
0.2	)
0.19	:(
0.177	Jonathan
0.177	Ford
0.177	writer
0.176	Boris
-0.162	'
Topic 6:
Coef.	

# LSI on the hashed corpus

Now all of the original documents have been hashed, and we can run the same analysis we ran with the plain corpus. 

In [103]:
np.random.seed(42)

## processing using the `corpushash` library

#### instatiating CorpusHash class, which hashes the provided corpus to the corpus_path:

In [104]:
%%time
hashed = CorpusHash(decoded_twitter, 'twitter')

2017-05-23 19:20:55,158 - corpushash.hashers - INFO - dictionaries from previous hashing found. loading them.
2017-05-23 19:20:55,158 : INFO : dictionaries from previous hashing found. loading them.
2017-05-23 19:20:59,462 - corpushash.hashers - INFO - 30000 documents hashed and saved to twitter/public/2017-05-23_19-20-55-158049.
2017-05-23 19:20:59,462 : INFO : 30000 documents hashed and saved to twitter/public/2017-05-23_19-20-55-158049.


CPU times: user 2.56 s, sys: 813 ms, total: 3.37 s
Wall time: 4.32 s


that is it. `corpushash`'s work is done.

#### building dictionary for the hashed corpus

In [105]:
id2word = gensim.corpora.Dictionary(hashed.read_hashed_corpus())

2017-05-23 19:20:59,480 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-05-23 19:21:00,100 : INFO : adding document #10000 to Dictionary(24343 unique tokens: ['U+)Z~DkL&Y*tv9|JE+WuIku3sqrEkbkcx@_!*{DS', 'OKi%KcU3pFH?GQV=dMytpb=FXANJR4+aX|-;T%#G', 'F#lHfIZmtjgo_S)N$r411J8#X%Veb(%r8qSY__7(', 'j<oOL8!FI2kXUr!C&u}#q#ipUM4pq%dbICGr(A8%', '8=wMA2$wf<p}Y|by;)*c7<9lUz3N?LKiGV4Y7Tld']...)
2017-05-23 19:21:00,835 : INFO : adding document #20000 to Dictionary(35614 unique tokens: ['U+)Z~DkL&Y*tv9|JE+WuIku3sqrEkbkcx@_!*{DS', 'OKi%KcU3pFH?GQV=dMytpb=FXANJR4+aX|-;T%#G', 'F#lHfIZmtjgo_S)N$r411J8#X%Veb(%r8qSY__7(', 'j<oOL8!FI2kXUr!C&u}#q#ipUM4pq%dbICGr(A8%', '8=wMA2$wf<p}Y|by;)*c7<9lUz3N?LKiGV4Y7Tld']...)
2017-05-23 19:21:01,562 : INFO : built Dictionary(42532 unique tokens: ['U+)Z~DkL&Y*tv9|JE+WuIku3sqrEkbkcx@_!*{DS', 'OKi%KcU3pFH?GQV=dMytpb=FXANJR4+aX|-;T%#G', 'F#lHfIZmtjgo_S)N$r411J8#X%Veb(%r8qSY__7(', 'j<oOL8!FI2kXUr!C&u}#q#ipUM4pq%dbICGr(A8%', '8=wMA2$wf<p}Y|by;)*c7<9lUz3N?LKi

In [106]:
id2word[0]

'U+)Z~DkL&Y*tv9|JE+WuIku3sqrEkbkcx@_!*{DS'

In [107]:
mm = [id2word.doc2bow(text) for text in hashed.read_hashed_corpus()]
gensim.corpora.MmCorpus.serialize(os.path.join(path, 'hashed_twitter_pt_tfidf.mm'), mm)

2017-05-23 19:22:06,911 : INFO : storing corpus in Matrix Market format to /home/guest/Documents/git/corpushash/hashed_twitter_pt_tfidf.mm
2017-05-23 19:22:07,342 : INFO : saving sparse matrix to /home/guest/Documents/git/corpushash/hashed_twitter_pt_tfidf.mm
2017-05-23 19:22:07,344 : INFO : PROGRESS: saving document #0
2017-05-23 19:22:07,362 : INFO : PROGRESS: saving document #1000
2017-05-23 19:22:07,381 : INFO : PROGRESS: saving document #2000
2017-05-23 19:22:07,398 : INFO : PROGRESS: saving document #3000
2017-05-23 19:22:07,416 : INFO : PROGRESS: saving document #4000
2017-05-23 19:22:07,433 : INFO : PROGRESS: saving document #5000
2017-05-23 19:22:07,451 : INFO : PROGRESS: saving document #6000
2017-05-23 19:22:07,470 : INFO : PROGRESS: saving document #7000
2017-05-23 19:22:07,489 : INFO : PROGRESS: saving document #8000
2017-05-23 19:22:07,507 : INFO : PROGRESS: saving document #9000
2017-05-23 19:22:07,526 : INFO : PROGRESS: saving document #10000
2017-05-23 19:22:07,557 : I

In [108]:
%%time
if os.path.exists(os.path.join(path, 'hashed_twitter_tfidf_model')):
    tfidf = models.TfidfModel.load(os.path.join(path, 'hashed_twitter_tfidf_model'))
else:
    tfidf = models.TfidfModel(mm)
    tfidf.save(os.path.join(path, 'hashed_twitter_tfidf_model'))

2017-05-23 19:22:08,157 : INFO : collecting document frequencies
2017-05-23 19:22:08,159 : INFO : PROGRESS: processing document #0
2017-05-23 19:22:08,189 : INFO : PROGRESS: processing document #10000
2017-05-23 19:22:08,237 : INFO : PROGRESS: processing document #20000
2017-05-23 19:22:08,284 : INFO : calculating IDF weights for 30000 documents and 42531 features (538552 matrix non-zeros)
2017-05-23 19:22:08,304 : INFO : saving TfidfModel object under /home/guest/Documents/git/corpushash/hashed_twitter_tfidf_model, separately None
2017-05-23 19:22:08,310 : INFO : saved /home/guest/Documents/git/corpushash/hashed_twitter_tfidf_model


CPU times: user 149 ms, sys: 2 ms, total: 151 ms
Wall time: 154 ms


The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.

In [109]:
def tfidf_corpus_stream(corpus):
    for doc in corpus:
        yield tfidf[doc]

In [110]:
tfidf_corpus_s = tfidf_corpus_stream(mm)

## Calculating the LSI model

In [111]:
if os.path.exists(os.path.join(path, 'hashed_twitter_lsi_model')):
    lsih = gensim.models.LsiModel.load(os.path.join(path, 'hashed_twitter_lsi_model'))
else:
    lsih = gensim.models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=id2word, num_topics=100)
    lsih.save(os.path.join(path, 'hashed_twitter_lsi_model'))

2017-05-23 19:22:16,876 : INFO : using serial LSI version on this node
2017-05-23 19:22:16,877 : INFO : updating model with new documents
2017-05-23 19:22:17,307 : INFO : preparing a new chunk of documents
2017-05-23 19:22:17,402 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 19:22:17,403 : INFO : 1st phase: constructing (42532, 200) action matrix
2017-05-23 19:22:17,625 : INFO : orthonormalizing (42532, 200) action matrix
2017-05-23 19:22:20,461 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2017-05-23 19:22:21,477 : INFO : computing the final decomposition
2017-05-23 19:22:21,477 : INFO : keeping 100 factors (discarding 22.970% of energy spectrum)
2017-05-23 19:22:21,616 : INFO : processed documents up to #20000
2017-05-23 19:22:21,618 : INFO : topic #0(20.207): 0.246*"jop5|1Q2hSwG`#tZK_2ioV=yOV*@^!wp5COZiM^a" + 0.207*"VSJ@Y<?2*Ml5K`*M^`fn%Sl87VeEJn97HZ@J!N)q" + 0.186*"!>9)Au<YaJX+{JM7AP?)U5*oajx`V))b?ULH?0Hh" + 0.174*"_m}~cvg1HpVIO|bt>dW8=MGs!C8L2kk

Let now look at the topics generated, decoding the hashed tokens using the `decode_dictionary`.

In [112]:
for n in range(17):
    print("====================")
    print("Topic {}:".format(n))
    print("Coef.\t Token")
    print("--------------------")
    for tok,coef in lsih.show_topic(n):
        tok = hashed.decode_dictionary[tok.strip()][0]
        print("{:.3}\t{}".format(coef,tok))

Topic 0:
Coef.	 Token
--------------------
0.313	"
0.181	Tories
0.171	preoccupied
0.171	inequality
0.171	@Tommy_Colc
0.171	wrote
0.167	Miliband
0.167	claiming
0.166	w
0.164	man
Topic 1:
Coef.	 Token
--------------------
0.247	SNP
-0.21	"
0.178	Sco
0.177	to
0.176	protect
0.176	lots
0.175	definitely
0.172	@NicolaSturgeon
0.172	rather
0.17	let
Topic 2:
Coef.	 Token
--------------------
0.194	the
0.175	.
-0.169	protect
-0.169	lots
-0.168	definitely
-0.168	Sco
0.159	a
0.159	%
0.148	I
-0.147	MPs
Topic 3:
Coef.	 Token
--------------------
-0.646	%
-0.319	-
-0.266	(
-0.22	)
-0.195	1
-0.131	CON
-0.13	LAB
-0.127	poll
-0.126	8
-0.125	34
Topic 4:
Coef.	 Token
--------------------
-0.236	thus
-0.236	ahem
-0.236	@thomasmessenger
-0.236	http://t.co/DkLwCwzhDA
-0.235	financial
-0.234	caused
-0.233	global
-0.233	crisis
-0.225	For
-0.225	overspent
Topic 5:
Coef.	 Token
--------------------
0.341	FT
0.321	(
-0.232	%
0.2	)
0.19	:(
0.177	Jonathan
0.177	Ford
0.177	writer
0.176	Boris
-0.162	'
Topic 6:
Coef.	

#### comparing the resulting topics we see that the NLP's results are the same regardless of which corpus we use, i.e., we can use hashed corpora to perform NLP tasks in a lossless manner.