# Clustering pages from wikipedia

```
                                    Model: OptiPlex 3020 01 
    .smmy+-`             ./sdy:     OS: Porteus 3.2 x86_64 
  `omdo.    `.-/+osssso+/-` `+dy.   Kernel: 4.9.0-porteus 
 `yms.   `:shmNmdhsoo++osyyo-``oh.  Uptime: up 9 hours, 47 minutes 
 hm/   .odNmds/.`    ``.....:::-+s  Packages: 565 
/m:  `+dNmy:`   `./oyhhhhyyooo++so  Shell: bash 4.3.46 
ys  `yNmy-    .+hmmho:-.`     ```   Resolution: 1280x1024 
s:  yNm+`   .smNd+.                 DE: XFCE 
`` /Nm:    +dNd+`                   WM: Xfwm4 
   yN+   `smNy.                     WM Theme: Vertex 
   dm    oNNy`                      Theme: Vertex [GTK2], Adwaita [GTK3] 
   hy   -mNm.                       Icons: Paper [GTK2], Adwaita [GTK3] 
   +y   oNNo                        Terminal: Xfce4-terminal 
   `y`  sNN:                        Terminal Font: DejaVu Sans Mono 12 
    `:  +NN:                        CPU: Intel Core i5-4590 (4) @ 3.7GHz 
     `  .mNo                        GPU: Intel Integrated Graphics 
         /mm`                       Memory: 1824MB / 7909MB 
```

In [1]:
import gensim
import logging, bz2, os
from corpushash import CorpusHash
import numpy as np

Slow version of gensim.models.doc2vec is being used


In [2]:
import json
from gensim import corpora, models, similarities

In [3]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Loading the corpus

First you have to download and preprocess the wikipedia dump file following [Gensim's tutorial](https://radimrehurek.com/gensim/wiki.html).

In this notebook we have analysed the portuguese language wikipedia from this file: [ptwiki-\*\*\*\*\*\*-articles.xml.bz2](https://dumps.wikimedia.org/ptwiki/20170501/). Then we ran this command:

```bash
$ python -m gensim.scripts.make_wiki ptwiki-********-pages-articles.xml.bz2

```

You can skip the above, if you have already converted the wikipedia dump to a gensim corpus.

In [4]:
np.random.seed(42)

In [5]:
project_path = '/mnt/sda2/wiki/'

In [6]:
id2word = gensim.corpora.Dictionary.load_from_text('/mnt/sda2/wiki/_wordids.txt.bz2')

In [7]:
id2word[0]

'formação'

In [8]:
# Loading the Corpus of documents defined as list of tuples (word_id, tfidf)
corpus_path = os.path.join(project_path, '_bow.mm')
mm = gensim.corpora.MmCorpus(corpus_path)

2017-05-23 14:08:13,984 : INFO : loaded corpus index from /mnt/sda2/wiki/_bow.mm.index
2017-05-23 14:08:13,985 : INFO : initializing corpus reader from /mnt/sda2/wiki/_bow.mm
2017-05-23 14:08:13,998 : INFO : accepted corpus with 698466 documents, 100000 features, 91887967 non-zero entries


In [9]:
%%time
wiki_tfidf_path = os.path.join(project_path, '_wiki_tfidf_model')
if os.path.exists(wiki_tfidf_path):
    tfidf = models.TfidfModel.load(wiki_tfidf_path)
else:
    tfidf = models.TfidfModel(mm)

2017-05-23 14:08:14,003 : INFO : collecting document frequencies
2017-05-23 14:08:14,007 : INFO : PROGRESS: processing document #0
2017-05-23 14:08:22,375 : INFO : PROGRESS: processing document #10000
2017-05-23 14:08:29,074 : INFO : PROGRESS: processing document #20000
2017-05-23 14:08:36,646 : INFO : PROGRESS: processing document #30000
2017-05-23 14:08:41,255 : INFO : PROGRESS: processing document #40000
2017-05-23 14:08:46,674 : INFO : PROGRESS: processing document #50000
2017-05-23 14:08:49,786 : INFO : PROGRESS: processing document #60000
2017-05-23 14:08:54,292 : INFO : PROGRESS: processing document #70000
2017-05-23 14:08:59,162 : INFO : PROGRESS: processing document #80000
2017-05-23 14:09:03,667 : INFO : PROGRESS: processing document #90000
2017-05-23 14:09:07,459 : INFO : PROGRESS: processing document #100000
2017-05-23 14:09:11,020 : INFO : PROGRESS: processing document #110000
2017-05-23 14:09:14,007 : INFO : PROGRESS: processing document #120000
2017-05-23 14:09:17,974 : 

CPU times: user 3min 34s, sys: 1.09 s, total: 3min 35s
Wall time: 3min 34s


## Calculating the LSI model


In [10]:
def tfidf_corpus_stream(corpus):
    for doc in corpus:
        yield tfidf[doc]

In [11]:
corpus = tfidf_corpus_stream(mm)

In [12]:
wiki_lsi_path = os.path.join(project_path, 'wiki_lsi_model')
if os.path.exists(wiki_lsi_path):
    lsi = gensim.models.LsiModel.load(wiki_lsi_path)
else:
    lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2word, num_topics=10)
    lsi.save(wiki_lsi_path)

2017-05-23 14:11:49,432 : INFO : using serial LSI version on this node
2017-05-23 14:11:49,432 : INFO : updating model with new documents
2017-05-23 14:12:08,992 : INFO : preparing a new chunk of documents
2017-05-23 14:12:09,878 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 14:12:09,879 : INFO : 1st phase: constructing (100000, 110) action matrix
2017-05-23 14:12:10,884 : INFO : orthonormalizing (100000, 110) action matrix
2017-05-23 14:12:20,046 : INFO : 2nd phase: running dense svd on (110, 20000) matrix
2017-05-23 14:12:20,541 : INFO : computing the final decomposition
2017-05-23 14:12:20,542 : INFO : keeping 10 factors (discarding 48.786% of energy spectrum)
2017-05-23 14:12:20,587 : INFO : processed documents up to #20000
2017-05-23 14:12:20,590 : INFO : topic #0(27.472): 0.573*"freguesia" + 0.328*"freguesias" + 0.282*"concelho" + 0.206*"km²" + 0.171*"evolução" + 0.162*"etários" + 0.149*"município" + 0.131*"densidade" + 0.125*"hab" + 0.111*"portuguesa"
2017-0

In [13]:
for n in range(10):
    print("====================")
    print("Topic {}:".format(n))
    print("Coef.\t Token")
    print("--------------------")
    for tok,coef in lsi.show_topic(n):
        print("{:.3}\t{}".format(coef,tok))

Topic 0:
Coef.	 Token
--------------------
0.435	localidades
0.397	km²
0.303	vizinhança
0.297	cobertos
0.215	americano
0.204	condado
0.169	censo
0.153	census
0.152	bureau
0.151	diagrama
Topic 1:
Coef.	 Token
--------------------
0.547	cintura
0.526	asteróide
0.353	orbital
0.208	asteroides
0.184	ua
0.182	excentricidade
0.18	inclinação
0.176	asteróides
0.164	descoberto
0.154	velocidade
Topic 2:
Coef.	 Token
--------------------
0.315	biodiversity
0.305	espécie
0.182	coleópteros
0.157	facility
0.157	ncbi
0.156	consultado
0.155	heritage
0.154	taxonomy
0.151	library
0.149	descritos
Topic 3:
Coef.	 Token
--------------------
0.509	ngc
0.253	constelação
0.251	galáxias
0.244	galáxia
0.133	extragaláctica
0.126	recta
0.126	astronomia
0.125	declinação
0.122	objectos
0.116	catalogue
Topic 4:
Coef.	 Token
--------------------
-0.366	ngc
-0.179	galáxias
-0.176	constelação
-0.175	galáxia
0.148	olímpicos
0.127	jogos
0.126	futebolistas
0.121	futebol
0.106	filmes
-0.0958	extragaláctica
Topic 5:
Coef.	 T

## Hashing the corpus

Before we hash the corpus we first need to put it back in plain text format, as it would be at the beginning of the analysis of a regular corpus. Now we have the corpus in Matrix Market format (mm). So we define a generator to be feed from the mm corpus stream and provided a plain text stream.

In [14]:
np.random.seed(42)

In [15]:
def plain_from_mm(dictionary, mmcorpus):
    for doc in mmcorpus:
        plain_doc = [dictionary[wid] for wid, tfidf in doc]
        yield plain_doc

In [16]:
for doc in plain_from_mm(id2word, mm):
    print(doc)
    break

['formação', 'estrelar', 'nuvem', 'magalhães', 'galáxia', 'irregular', 'mosaico', 'nebulosa', 'caranguejo', 'remanescente', 'supernova', 'astronomia', 'ciência', 'natural', 'estuda', 'corpos', 'celestes', 'estrelas', 'planetas', 'cometas', 'nebulosas', 'aglomerados', 'galáxias', 'fenômenos', 'originam', 'fora', 'atmosfera', 'terra', 'radiação', 'cósmica', 'fundo', 'micro', 'ondas', 'preocupada', 'evolução', 'física', 'química', 'movimento', 'objetos', 'desenvolvimento', 'universo', 'antigas', 'ciências', 'culturas', 'pré', 'históricas', 'deixaram', 'registrados', 'vários', 'artefatos', 'astronômicos', 'stonehenge', 'montes', 'menires', 'primeiras', 'civilizações', 'babilônios', 'gregos', 'chineses', 'indianos', 'iranianos', 'maias', 'realizaram', 'observações', 'céu', 'noturno', 'entanto', 'invenção', 'telescópio', 'permitiu', 'moderna', 'historicamente', 'incluiu', 'disciplinas', 'tão', 'diversas', 'astrometria', 'navegação', 'astronômica', 'observacional', 'elaboração', 'calendários'

In [17]:
pc = plain_from_mm(id2word, mm)

In [18]:
%%time
hashed = CorpusHash(pc, project_path)

2017-05-23 15:07:35,915 - corpushash.hashers - INFO - 698466 documents hashed and saved to /mnt/sda2/wiki/public/2017-05-23_14-21-36-662195.
2017-05-23 15:07:35,915 : INFO : 698466 documents hashed and saved to /mnt/sda2/wiki/public/2017-05-23_14-21-36-662195.


CPU times: user 9min 44s, sys: 25.9 s, total: 10min 10s
Wall time: 45min 59s


In [19]:
"""#debug
for i, k in zip(id2word, hashed_dictionary):
    if i != hashed.decode_dictionary[k][0]:
        break"""

'#debug\nfor i, k in zip(id2word, hashed_dictionary):\n    if i != hashed.decode_dictionary[k][0]:\n        break'

In [20]:
"""#debug
# they read in the same order, as expected.
superix=20000
for ix, docs in enumerate(zip(plain_from_mm(id2word, mm), hashed.read_hashed_corpus())):
    doc, hashed_doc = docs[0], docs[1]
    if ix == superix:
        for token, hashed_token in zip(doc, hashed_doc):
            print(token, hashed_token, hashed.decode_dictionary[hashed_token][0])
    if ix > superix:
        break"""

'#debug\n# they read in the same order, as expected.\nsuperix=20000\nfor ix, docs in enumerate(zip(plain_from_mm(id2word, mm), hashed.read_hashed_corpus())):\n    doc, hashed_doc = docs[0], docs[1]\n    if ix == superix:\n        for token, hashed_token in zip(doc, hashed_doc):\n            print(token, hashed_token, hashed.decode_dictionary[hashed_token][0])\n    if ix > superix:\n        break'

Now all of the original documents have been hashed, and we can run the same analysis we ran with the plain corpus. 

First let's create the dictionary:

In [21]:
dictionary_path = os.path.join(project_path, 'ptwiki.dict')

In [22]:
%%time
if os.path.exists(dictionary_path):
    dictionary = corpora.Dictionary.load(dictionary_path)
else:
    dictionary = corpora.Dictionary(hashed.read_hashed_corpus())
    dictionary.save(dictionary_path)

2017-05-23 15:07:39,986 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-05-23 15:11:27,418 : INFO : adding document #10000 to Dictionary(90687 unique tokens: ['WOpEW4PKUo@378EdZThK=Sh~KHCSXQf|amESmOR3', 'xSGnDd}6-q3_fl(K*3M$fQf(BSIe<|c1CCRO`@RQ', '3i&wj3TB37Z`dpf=nFq2Hs67DN&vohFZBQWKL6`6', '$S@}NLtb_wm&2Co5?jW3yG(r=mK*IHiUpV5;r8C<', '?PFI%o;@tMY_iTd6MIHDB8lTL_Hvmz!x(9$u3UMn']...)
2017-05-23 15:12:33,768 : INFO : adding document #20000 to Dictionary(94596 unique tokens: ['WOpEW4PKUo@378EdZThK=Sh~KHCSXQf|amESmOR3', 'xSGnDd}6-q3_fl(K*3M$fQf(BSIe<|c1CCRO`@RQ', '3i&wj3TB37Z`dpf=nFq2Hs67DN&vohFZBQWKL6`6', '$S@}NLtb_wm&2Co5?jW3yG(r=mK*IHiUpV5;r8C<', '?PFI%o;@tMY_iTd6MIHDB8lTL_Hvmz!x(9$u3UMn']...)
2017-05-23 15:13:38,832 : INFO : adding document #30000 to Dictionary(97829 unique tokens: ['WOpEW4PKUo@378EdZThK=Sh~KHCSXQf|amESmOR3', 'xSGnDd}6-q3_fl(K*3M$fQf(BSIe<|c1CCRO`@RQ', '3i&wj3TB37Z`dpf=nFq2Hs67DN&vohFZBQWKL6`6', '$S@}NLtb_wm&2Co5?jW3yG(r=mK*IHiUpV5;r8C<', '?PFI%o;@tMY_

CPU times: user 3min 17s, sys: 21.5 s, total: 3min 38s
Wall time: 1h 39s


We save the dictionary in case we want re-run this analysis.

In [23]:
"""#debug
#dictionaries are different, one is sequential while the other is not
for ix, i in enumerate(dictionary):
    if ix>10:
        break
    print(i)"""

'#debug\n#dictionaries are different, one is sequential while the other is not\nfor ix, i in enumerate(dictionary):\n    if ix>10:\n        break\n    print(i)'

Now we create yet another generator to stream the hashed corpus as in gensim *mm* format

In [24]:
def corpus_stream(docs, dictio=dictionary):
    for doc in docs:
        yield dictio.doc2bow(doc)

We need to reset the old document generator and instantiate the corpus one. 

In [25]:
corpus_s = corpus_stream(hashed.read_hashed_corpus())

In [26]:
"""for doc in corpus_s:
    print(doc)
    break"""

'for doc in corpus_s:\n    print(doc)\n    break'

In [27]:
%%time
hashed_wiki_tfidf_path = os.path.join(project_path, 'hashed_wiki_tfidf_model')
if os.path.exists(hashed_wiki_tfidf_path):
    tfidf = models.TfidfModel.load(hashed_wiki_tfidf_path)
else:
    tfidf = models.TfidfModel(corpus_s)

2017-05-23 16:08:21,871 : INFO : collecting document frequencies
2017-05-23 16:08:21,886 : INFO : PROGRESS: processing document #0
2017-05-23 16:09:26,971 : INFO : PROGRESS: processing document #10000
2017-05-23 16:10:32,429 : INFO : PROGRESS: processing document #20000
2017-05-23 16:11:36,343 : INFO : PROGRESS: processing document #30000
2017-05-23 16:12:39,265 : INFO : PROGRESS: processing document #40000
2017-05-23 16:13:39,596 : INFO : PROGRESS: processing document #50000
2017-05-23 16:14:36,455 : INFO : PROGRESS: processing document #60000
2017-05-23 16:15:36,804 : INFO : PROGRESS: processing document #70000
2017-05-23 16:16:38,558 : INFO : PROGRESS: processing document #80000
2017-05-23 16:17:38,536 : INFO : PROGRESS: processing document #90000
2017-05-23 16:18:39,779 : INFO : PROGRESS: processing document #100000
2017-05-23 16:19:35,960 : INFO : PROGRESS: processing document #110000
2017-05-23 16:20:36,937 : INFO : PROGRESS: processing document #120000
2017-05-23 16:21:36,163 : 

CPU times: user 3min 2s, sys: 20.7 s, total: 3min 23s
Wall time: 47min 17s


In [28]:
tfidf.save(hashed_wiki_tfidf_path)

2017-05-23 16:55:39,555 : INFO : saving TfidfModel object under /mnt/sda2/wiki/hashed_wiki_tfidf_model, separately None
2017-05-23 16:55:39,614 : INFO : saved /mnt/sda2/wiki/hashed_wiki_tfidf_model


The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.

In [29]:
def tfidf_corpus_stream(corpus):
    for doc in corpus:
        yield tfidf[doc]

Now we chain the three generators, to assure we never have to bring the entire corpus into memory.

In [30]:
np.random.seed(42)

In [31]:
corpus_s = corpus_stream(hashed.read_hashed_corpus())
tfidf_corpus_s = tfidf_corpus_stream(corpus_s)

In [32]:
%%time
hashed_wiki_lsi_path = os.path.join(project_path, 'hashed_wiki_lsi')
if os.path.exists(hashed_wiki_lsi_path):
    lsih = models.lsimodel.LsiModel.load(hashed_wiki_lsi_path)
else:
    lsih = models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=dictionary, num_topics=10)

2017-05-23 16:55:42,304 : INFO : using serial LSI version on this node
2017-05-23 16:55:42,304 : INFO : updating model with new documents
2017-05-23 16:57:54,906 : INFO : preparing a new chunk of documents
2017-05-23 16:57:55,889 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 16:57:55,889 : INFO : 1st phase: constructing (100000, 110) action matrix
2017-05-23 16:57:56,906 : INFO : orthonormalizing (100000, 110) action matrix
2017-05-23 16:58:06,720 : INFO : 2nd phase: running dense svd on (110, 20000) matrix
2017-05-23 16:58:07,344 : INFO : computing the final decomposition
2017-05-23 16:58:07,345 : INFO : keeping 10 factors (discarding 42.023% of energy spectrum)
2017-05-23 16:58:07,388 : INFO : processed documents up to #20000
2017-05-23 16:58:07,406 : INFO : topic #0(27.831): 0.289*"M{SF6`Fo;SCS<JhAt*F{H(Jf$?N=AwxH3JKy+T4T" + 0.257*"&Z@9s?K$f^2}>G$i4`Iep&k6xar{$5$x%{_-y+<U" + 0.252*"ey0hRW6NvHJJ|x#qWeh%0FZ-I5rTqK_5J_6m@z0e" + 0.219*"^C;WM@o_t?27Jot=SU)_P6R?kBP0Vy

CPU times: user 17min 20s, sys: 1min 51s, total: 19min 12s
Wall time: 45min 46s


In [33]:
lsih.save(hashed_wiki_lsi_path)

2017-05-23 17:41:28,849 : INFO : saving Projection object under /mnt/sda2/wiki/hashed_wiki_lsi.projection, separately None
2017-05-23 17:41:28,999 : INFO : saved /mnt/sda2/wiki/hashed_wiki_lsi.projection
2017-05-23 17:41:29,000 : INFO : saving LsiModel object under /mnt/sda2/wiki/hashed_wiki_lsi, separately None
2017-05-23 17:41:29,000 : INFO : not storing attribute projection
2017-05-23 17:41:29,001 : INFO : not storing attribute dispatcher
2017-05-23 17:41:29,069 : INFO : saved /mnt/sda2/wiki/hashed_wiki_lsi


Let now look at the topics generated.

In [34]:
for n in range(10):
    print("====================")
    print("Topic {}:".format(n))
    print("Coef.\t Token")
    print("--------------------")
    for tok,coef in lsih.show_topic(n):
        tok = hashed.decode_dictionary[tok.strip()][0]
        print("{:.3}\t{}".format(coef,tok))

Topic 0:
Coef.	 Token
--------------------
0.221	cobertos
0.221	census
0.218	bureau
0.218	vizinhança
0.218	diagrama
0.214	states
0.211	raio
0.205	censo
0.204	demografia
0.203	localidades
Topic 1:
Coef.	 Token
--------------------
0.301	asteroides
0.298	cintura
0.297	excentricidade
0.293	inclinação
0.292	ua
0.289	orbital
0.281	asteróides
0.281	asteróide
0.268	descoberto
0.252	velocidade
Topic 2:
Coef.	 Token
--------------------
0.18	biodiversity
0.18	facility
0.179	ncbi
0.177	heritage
0.177	consultado
0.176	taxonomy
0.172	library
0.169	information
0.168	encyclopedia
0.167	database
Topic 3:
Coef.	 Token
--------------------
0.0544	brasileiro
0.0519	carreira
0.0508	jogos
0.0479	futebol
0.0464	fez
0.0462	paulo
0.0459	tinha
0.0444	atualmente
0.0442	século
0.0441	lugar
Topic 4:
Coef.	 Token
--------------------
-0.248	extragaláctica
-0.245	ngc
-0.242	recta
-0.242	declinação
-0.242	galáxias
-0.236	galáxia
-0.236	constelação
-0.235	objectos
-0.23	astronomia
-0.221	catalogue
Topic 5:
Coef.	 To

In [35]:
lsih.show_topic(0)

[('5fg5GD-Ij2I69<iK6G0?uhBO=j@_Sp2O^17R_gg7', 0.22109996235195337),
 ('o(_LhGbaV}N%8Az@>;4ZC60f{xaQeuvq$!QzjnVG', 0.22076789583387754),
 ('lIADR?ITEeAI|N)VakLKIO`c99TcN!L_3T^A}NRF', 0.21848483280217088),
 ('r0c*|U3p9Rrk=QY#=#-hhC(a)fe{zfzvUa(l@LEV', 0.21833099049349497),
 ('&($@|X=8H`-Ty#1kSWtT>nrw&=omiLiS_E>R*!s6', 0.21804006663288547),
 ('hEn@UcYm05x{=hllb|8HUH_f7Lg#~oQ?%c5IzQQO', 0.21355489189336996),
 ('1(bz=uXxGr3nISH$M(!puP|>kW*LAEa>-;nR^$>u', 0.2107180607988976),
 ('n*Kp}>-h7Ne%N1Vkhtt6wM3?>4A}m95d>f6z-j<e', 0.20523375170376551),
 ('F1PMR=)g;&vXrQa6VkEwrpY9=s9x3?cSUe*;A6lB', 0.20392704866656083),
 ('Wk_LZIY0p<2ftzSs?U#~k8tZ9k9HSSxnU_WVefyy', 0.20284362440182038)]

In [36]:
lsi.show_topic(0)

[('localidades', 0.43450819285995795),
 ('km²', 0.39696094051681735),
 ('vizinhança', 0.30305831803045813),
 ('cobertos', 0.2967907487941337),
 ('americano', 0.21481208674479504),
 ('condado', 0.20443991927578228),
 ('censo', 0.16891831442149335),
 ('census', 0.15348634993193244),
 ('bureau', 0.15169128256176762),
 ('diagrama', 0.15134569416979141)]