## Corpora and Vector Spaces

#### Reference : https://radimrehurek.com/gensim/tutorial.html

In [12]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [13]:
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "C:\Users\nsavi\AppData\Local\Temp" will be used to save temporary dictionary and corpus.


In [14]:
from gensim import corpora


2019-10-22 15:26:29,883 : INFO : 'pattern' package not found; tag filters are not available for English


In [15]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of only a single sentence

First, let's tokenize the documents, remove common words(using a toy stoplist) as well as words that only appear once in the corpus:

In [16]:
from pprint import pprint #pretty-printer
from collections import defaultdict

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
#print(stoplist)

texts = [[word for word in document.lower().split() if word not in stoplist]
        for document in documents
        ]
#print(texts)

#remove common words and tokenize
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

        
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]


from pprint import pprint
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


There are many ways of processing the documents, here we only split on workspace to tokenize, followed by lowercasing each word.

To convert documents to vectors, we"ll use a document representation called **bag-of-words**. \
In this representation, each document is represented by one vector where a vector element i represents the number of times the ith word appears in the document.

It is advantageous to represent the questions only by their(integer) ids.
The mapping between the questions and ids is called dictionary:


In [17]:
dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  #store the dictionary, for future reference
print(dictionary)

2019-10-22 15:26:29,913 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-22 15:26:29,914 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2019-10-22 15:26:29,915 : INFO : saving Dictionary object under C:\Users\nsavi\AppData\Local\Temp\deerwester.dict, separately None
2019-10-22 15:26:29,917 : INFO : saved C:\Users\nsavi\AppData\Local\Temp\deerwester.dict


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


We assigned unique integer ID to all words appearing in the processed corpus with the genism.corpora.dictionary.Dictionary class.\
This sweeps across the texts, collecting word counts and relevant statistics.
In the end we see 12 distinct words in the processed corpus, which means each document will be represented by twelve numbers(ie., by a 12-D vector). To see the mapping between words and their ids:

In [18]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [19]:
print(dictionary.id2token)

{}


In [20]:
print(dictionary.cfs)

{1: 2, 2: 2, 0: 2, 4: 2, 7: 3, 5: 4, 3: 2, 6: 2, 8: 2, 9: 3, 10: 3, 11: 2}


In [21]:
print(dictionary.dfs)

{1: 2, 2: 2, 0: 2, 4: 2, 7: 3, 5: 3, 3: 2, 6: 2, 8: 2, 9: 3, 10: 3, 11: 2}


In [22]:
print(dictionary.num_docs)

9


In [23]:
print(dictionary.num_pos)

29


In [24]:
print(dictionary.num_nnz)

28


To actually convert tokenized documents to vectors:

In [25]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 1), (1, 1)]


The function doc2bow() simply counts the number of occurances of each distint word, converts the word to its integer word id and returns the result as a bag-of-words-- a sparse vector, in  the form of [(word_id, word_count),....].\

As the token_id is 1 for "human" and 0 for "computer", the new document "Human computer interaction" will be transformed to [(1, 1), (0, 1)]. The words "computer" and "human" exist in the dictionary and appear once. Thus they become (1,1), (0, 1) repectively in the sparse vector. The word "interaction" doesn't exist in the dictionary and, thus, will not show up in the sparse vector. The other ten dictionary words, that appear(implicitly) zero times, will not show up in the sparse vector and there will never be a element in the sparse vector like (3,0).



In [26]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus)
for c in corpus:
    print(c)

2019-10-22 15:26:29,991 : INFO : storing corpus in Matrix Market format to C:\Users\nsavi\AppData\Local\Temp\deerwester.mm
2019-10-22 15:26:29,994 : INFO : saving sparse matrix to C:\Users\nsavi\AppData\Local\Temp\deerwester.mm
2019-10-22 15:26:29,994 : INFO : PROGRESS: saving document #0
2019-10-22 15:26:29,996 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2019-10-22 15:26:29,997 : INFO : saving MmCorpus index to C:\Users\nsavi\AppData\Local\Temp\deerwester.mm.index


[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


## Corpus Streaming- One Document at a Time

Note that corpus above resides fully in memory, as a plain Python list. In this simple example, it doesn't matter much, but just to make things clear, let's assume there are millions of documents in the corpus. Storing all of them in RAM won't do. Instead, let's assume the documents are stored in a file on disk, one document per line.\
Gensim only requires that a corpus be able to return one document vector at a time:

In [27]:
from smart_open import smart_open
class MyCorpus(object):
    def __iter__(self):
        for line in smart_open('datasets/mycorpus.txt', 'rb'):
            # assume there is one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

The assumption that each document occupies one line in a single file is not important; you can design the `__iter__` function to fit your input format.
Just pass your input to retrive a clean list of tokens in each document, then convert the tokens via dictionary to their IDs and yeild the resulting sparse vector inside `__iter__`.

In [28]:
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x109111D0>


`corpus_memory_friendly` is now an object. We didn't define any way to `print` it, so print just outputs address of the object in memory. Not very useful. To see the constituent vectors, let's iterate over  the corpus and print each document vector.

In [29]:
for vector in corpus_memory_friendly: # Load one vector into memory at a time
    print(vector)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Now the corpus can now be as large as you want. 

In [32]:
from six import iteritems
from smart_open import smart_open

#collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in smart_open('datasets/mycorpus.txt', 'rb'))
# print(dictionary)
# print(type(dictionary))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist 
           if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]

# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)
print(dictionary)

2019-10-22 16:05:03,583 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-22 16:05:03,586 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


### Corpus Formats

There exsists several file formats for serializing a Vector Space corpus(~sequence of vectors) to disk. Gensim implements then via the *streaming corpus interface* mentioned earlier: documents are read from(or stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memort at once.

One more notable file formats is the __Matrix Market format__. To save a corpus in the Matrix Market format.

In [42]:
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it

corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.mm'), corpus)

2019-10-22 17:10:21,323 : INFO : storing corpus in Matrix Market format to C:\Users\nsavi\AppData\Local\Temp\corpus.mm
2019-10-22 17:10:21,325 : INFO : saving sparse matrix to C:\Users\nsavi\AppData\Local\Temp\corpus.mm
2019-10-22 17:10:21,326 : INFO : PROGRESS: saving document #0
2019-10-22 17:10:21,328 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2019-10-22 17:10:21,330 : INFO : saving MmCorpus index to C:\Users\nsavi\AppData\Local\Temp\corpus.mm.index


In [43]:
corpora.SvmLightCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.svmlight'), corpus)
corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)
corpora.LowCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.low'), corpus)

2019-10-22 17:10:23,356 : INFO : converting corpus to SVMlight format: C:\Users\nsavi\AppData\Local\Temp\corpus.svmlight
2019-10-22 17:10:23,358 : INFO : saving SvmLightCorpus index to C:\Users\nsavi\AppData\Local\Temp\corpus.svmlight.index
2019-10-22 17:10:23,360 : INFO : no word id mapping provided; initializing from corpus
2019-10-22 17:10:23,361 : INFO : storing corpus in Blei's LDA-C format into C:\Users\nsavi\AppData\Local\Temp\corpus.lda-c
2019-10-22 17:10:23,364 : INFO : saving vocabulary of 2 words to C:\Users\nsavi\AppData\Local\Temp\corpus.lda-c.vocab
2019-10-22 17:10:23,366 : INFO : saving BleiCorpus index to C:\Users\nsavi\AppData\Local\Temp\corpus.lda-c.index
2019-10-22 17:10:23,368 : INFO : no word id mapping provided; initializing from corpus
2019-10-22 17:10:23,369 : INFO : storing corpus in List-Of-Words format into C:\Users\nsavi\AppData\Local\Temp\corpus.low
2019-10-22 17:10:23,372 : INFO : saving LowCorpus index to C:\Users\nsavi\AppData\Local\Temp\corpus.low.index

Conversely, to load a corpus iterator from a Matrix Market file:

In [44]:
corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'corpus.mm'))

2019-10-22 17:10:27,779 : INFO : loaded corpus index from C:\Users\nsavi\AppData\Local\Temp\corpus.mm.index
2019-10-22 17:10:27,780 : INFO : initializing corpus reader from C:\Users\nsavi\AppData\Local\Temp\corpus.mm
2019-10-22 17:10:27,782 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries


In [45]:
print(corpus)

MmCorpus(2 documents, 2 features, 1 non-zero entries)


In [46]:
# one way of printing a corpus: load it entirely into memory
print(list(corpus))  # calling list() will convert any sequence to a plain Python list

[[(1, 0.5)], []]


In [None]:
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
    print(doc)

### Compatibility with NumPy and SciPy

Gensim also contains efficient utility functions to help converting from/to '__numpy__' matrices:

In [50]:
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
print(numpy_matrix)
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
#print(corpus)
numpy_matrix_dense = gensim.matutils.corpus2dense(corpus, num_terms=10)
print(numpy_matrix_dense)

[[8 1]
 [1 6]
 [5 6]
 [3 1]
 [9 7]]
[[8. 1.]
 [1. 6.]
 [5. 6.]
 [3. 1.]
 [9. 7.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


and from/to scipy.sparse  matrices:

In [56]:
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)
print(scipy_sparse_matrix)
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
print(scipy_sparse_matrix)





## 2. Topics and Transformations

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', )

1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy(marginal data trends are ignored , noise-reduction).

### Creating the Corpus

* First, we need to create a corpus to work with.
* This step is the same as in the previous tutorial;
* if you completed it, feel free to skip to the next section

In [57]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]


In [75]:
# remove common words and tokenize

stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [78]:
# remove words that appear only once

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

        

texts = [
       [token for token in text if frequency[token] > 1]
       for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#print(corpus)

2019-10-22 19:13:01,791 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-22 19:13:01,792 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


### Creating a transformation

The transformations are standard Python objects, typically initialized by means of a definition: `training corpus`

In [79]:
from gensim import models

tfidf = models.TfidfModel(corpus)  #step 1-- initialize a model



2019-10-22 19:46:46,263 : INFO : collecting document frequencies
2019-10-22 19:46:46,264 : INFO : PROGRESS: processing document #0
2019-10-22 19:46:46,265 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)


We used our old corpus from tutorial 1 to initialise the transformational model. Different transformation may require different initialisation parameters; in case of TfIdf, the "training" consists simply of going through the supplied corpus once and computing document frequencies of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet Allocation,is much more involved and , cosequently , takes much more time.

### Tranforming vectors

From now on, "tfidf" is treated as a read-only object that can be used to convert any vector from the old representation(bag-of-words integer counts) to the new representation. ( TfIdf real-valued weights):

In [80]:
doc_bow = [(0,1), (1,1)]
print(tfidf[doc_bow])   # step-2-- use the model to transform vectors

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


In [81]:
# Or to apply a transformation to a whole corpus

corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


In this particular case, we are transforming the same corpus that we used for training, but this is only incidental. Once the transformation model has been initialised, it can be used on any vectors (provided they come from the same vector space, of course), even if they are not used in the training corpus at all. This is achived by a process called folding-in for LSA, by topic inference for LDA etc.

Transformations can also be serialised, one on top of another, in sort of chain:

In [85]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)   #initialize an LSI tranformation
print(lsi)

corpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus
print(corpus_lsi)

2019-10-22 20:41:29,776 : INFO : using serial LSI version on this node
2019-10-22 20:41:29,777 : INFO : updating model with new documents
2019-10-22 20:41:29,778 : INFO : preparing a new chunk of documents
2019-10-22 20:41:29,779 : INFO : using 100 extra samples and 2 power iterations
2019-10-22 20:41:29,780 : INFO : 1st phase: constructing (12, 102) action matrix
2019-10-22 20:41:29,781 : INFO : orthonormalizing (12, 102) action matrix
2019-10-22 20:41:29,784 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2019-10-22 20:41:29,785 : INFO : computing the final decomposition
2019-10-22 20:41:29,786 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2019-10-22 20:41:29,787 : INFO : processed documents up to #9
2019-10-22 20:41:29,788 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2019-10-22 20:41:29,789 : INFO : topic #

LsiModel(num_terms=12, num_topics=2, decay=1.0, chunksize=20000)
<gensim.interfaces.TransformedCorpus object at 0x109B01F0>


Here we transformed our Tf-Idf corpus via `Latent Semantic Indexing` into a latent 2D space. Now you're probably wondering: what do these two latent dimensions stand for? Let's inspect with: func: `models.LsiModel.print_topics`:



In [86]:
lsi.print_topics(2)

2019-10-22 20:45:17,458 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2019-10-22 20:45:17,459 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"


[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

It appears that according to LSI, "trees", "graph" and "minors" are all related words(and contribute most to the direction of the first topic), while the second topic practically concerns itself with all the other words, As expected, the first five documents and more strongly related to the second topic while the remaining four documents to the first topic:

 both bow->tfidf and tfidf -> lsi transformation are actually executed here, on the fly

In [87]:
for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, 0.06600783396090704), (1, -0.5200703306361848)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142993), (1, -0.7609563167700033)] A survey of user opinion of computer system response time
[(0, 0.0899263997244685), (1, -0.7241860626752508)] The EPS user interface management system
[(0, 0.07585847652178529), (1, -0.6320551586003424)] System and human system engineering testing of EPS
[(0, 0.10150299184980505), (1, -0.5737308483002945)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378302), (1, 0.16115180214026184)] The generation of random binary unordered trees
[(0, 0.8774787673119822), (1, 0.16758906864659903)] The intersection graph of paths in trees
[(0, 0.9098624686818569), (1, 0.14086553628719517)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569283), (1, -0.05392907566389018)] Graph minors A survey


In [89]:
# Model persistency is achieved with the :func:`save` and :func:`load` functions:

lsi.save('model.lsi')  # same for tfidf, lda, ...
lsi = models.LsiModel.load('model.lsi')

2019-10-22 21:00:42,386 : INFO : saving Projection object under model.lsi.projection, separately None
2019-10-22 21:00:42,389 : INFO : saved model.lsi.projection
2019-10-22 21:00:42,391 : INFO : saving LsiModel object under model.lsi, separately None
2019-10-22 21:00:42,392 : INFO : not storing attribute projection
2019-10-22 21:00:42,392 : INFO : not storing attribute dispatcher
2019-10-22 21:00:42,395 : INFO : saved model.lsi
2019-10-22 21:00:42,396 : INFO : loading LsiModel object from model.lsi
2019-10-22 21:00:42,398 : INFO : loading id2word recursively from model.lsi.id2word.* with mmap=None
2019-10-22 21:00:42,400 : INFO : setting ignored attribute projection to None
2019-10-22 21:00:42,401 : INFO : setting ignored attribute dispatcher to None
2019-10-22 21:00:42,401 : INFO : loaded model.lsi
2019-10-22 21:00:42,402 : INFO : loading LsiModel object from model.lsi.projection
2019-10-22 21:00:42,404 : INFO : loaded model.lsi.projection


The next question might be: just how exactly similar are those documents to each other?
Is there a way to formalise the similarity, so that for a given input document, we can order some other set of documents according to their similarity?
Similarly queries are covered in the next tutorial

#### Available Transformation

1. Term Frequency
2. Latent Semantic Indexing


## Similarity Queries

Demonstrates querying a corpus for similar documents

In [91]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#### Creating the Corpus

First, we need to create a corpus to work with.
This step is the same as in the previous tutorial.
if you completed it, feel free to skip to the next section.

In [92]:
from collections import defaultdict
from gensim import corpora

In [93]:
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [94]:
# remove common words and tokenize

stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

In [95]:
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

2019-10-22 21:19:04,056 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-22 21:19:04,058 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


#### Similarity interface

We covered what it means to create a corpus in the Vector Space Model and how to transform it between different vector spaces.
A common reason for such a charade is that we want to determine **similarity between pairs of documents**, or the **similarity between a specific document and a set of other documents**

In [96]:
from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

2019-10-22 21:23:16,442 : INFO : using serial LSI version on this node
2019-10-22 21:23:16,443 : INFO : updating model with new documents
2019-10-22 21:23:16,444 : INFO : preparing a new chunk of documents
2019-10-22 21:23:16,445 : INFO : using 100 extra samples and 2 power iterations
2019-10-22 21:23:16,447 : INFO : 1st phase: constructing (12, 102) action matrix
2019-10-22 21:23:16,448 : INFO : orthonormalizing (12, 102) action matrix
2019-10-22 21:23:16,451 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2019-10-22 21:23:16,452 : INFO : computing the final decomposition
2019-10-22 21:23:16,452 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)
2019-10-22 21:23:16,453 : INFO : processed documents up to #9
2019-10-22 21:23:16,454 : INFO : topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"response" + 0.265*"time" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph"
2019-10-22 21:23:16,455 : INFO : topic #1(2

Now suppose a user typed in the query `"Human computer interaction"`. We would
 like to sort our nine corpus documents in decreasing order of relevance to this query.
 Unlike modern search engines, here we only concentrate on a single aspect of possible
 similarities---on apparent semantic relatedness of their texts (words). No hyperlinks,
 no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [97]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)


[(0, 0.4618210045327159), (1, -0.07002766527899912)]


In [98]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it


2019-10-22 21:24:53,655 : INFO : creating matrix with 9 documents and 2 features


In [100]:
index.save('deerwester.index')
index = similarities.MatrixSimilarity.load('deerwester.index')


2019-10-22 21:26:05,597 : INFO : saving MatrixSimilarity object under deerwester.index, separately None
2019-10-22 21:26:05,599 : INFO : saved deerwester.index
2019-10-22 21:26:05,601 : INFO : loading MatrixSimilarity object from deerwester.index
2019-10-22 21:26:05,603 : INFO : loaded deerwester.index


In [101]:
# Performing queries
# ++++++++++++++++++
#
# To obtain similarities of our query document against the nine indexed documents:

sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.09879464), (8, 0.050041765)]


In [102]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims):
    print(s, documents[i])

(2, 0.9984453) Human machine interface for lab abc computer applications
(0, 0.998093) A survey of user opinion of computer system response time
(3, 0.9865886) The EPS user interface management system
(1, 0.93748635) System and human system engineering testing of EPS
(4, 0.90755945) Relation of user perceived response time to error measurement
(8, 0.050041765) The generation of random binary unordered trees
(7, -0.09879464) The intersection graph of paths in trees
(6, -0.10639259) Graph minors IV Widths of trees and well quasi ordering
(5, -0.12416792) Graph minors A survey
