#### Reference : https://radimrehurek.com/gensim/tutorial.html

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [2]:
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "C:\Users\nsavi\AppData\Local\Temp" will be used to save temporary dictionary and corpus.


In [3]:
from gensim import corpora


2019-10-21 18:38:16,156 : INFO : 'pattern' package not found; tag filters are not available for English


In [4]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of only a single sentence

First, let's tokenize the documents, remove common words(using a toy stoplist) as well as words that only appear once in the corpus:

In [20]:
from pprint import pprint #pretty-printer
from collections import defaultdict

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
#print(stoplist)

texts = [[word for word in document.lower().split() if word not in stoplist]
        for document in documents
        ]
#print(texts)

#remove common words and tokenize
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

        
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]


from pprint import pprint
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


There are many ways of processing the documents, here we only split on workspace to tokenize, followed by lowercasing each word.

To convert documents to vectors, we"ll use a document representation called **bag-of-words**. \
In this representation, each document is represented by one vector where a vector element i represents the number of times the ith word appears in the document.

It is advantageous to represent the questions only by their(integer) ids.
The mapping between the questions and ids is called dictionary:


In [26]:
dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  #store the dictionary, for future reference
print(dictionary)

2019-10-21 20:43:46,632 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-21 20:43:46,633 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2019-10-21 20:43:46,634 : INFO : saving Dictionary object under C:\Users\nsavi\AppData\Local\Temp\deerwester.dict, separately None
2019-10-21 20:43:46,636 : INFO : saved C:\Users\nsavi\AppData\Local\Temp\deerwester.dict


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


We assigned unique integer ID to all words appearing in the processed corpus with the genism.corpora.dictionary.Dictionary class.\
This sweeps across the texts, collecting word counts and relevant statistics.
In the end we see 12 distinct words in the processed corpus, which means each document will be represented by twelve numbers(ie., by a 12-D vector). To see the mapping between words and their ids:

In [39]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [28]:
print(dictionary.id2token)

{}


In [29]:
print(dictionary.cfs)

{1: 2, 2: 2, 0: 2, 4: 2, 7: 3, 5: 4, 3: 2, 6: 2, 8: 2, 9: 3, 10: 3, 11: 2}


In [30]:
print(dictionary.dfs)

{1: 2, 2: 2, 0: 2, 4: 2, 7: 3, 5: 3, 3: 2, 6: 2, 8: 2, 9: 3, 10: 3, 11: 2}


In [31]:
print(dictionary.num_docs)

9


In [32]:
print(dictionary.num_pos)

29


In [33]:
print(dictionary.num_nnz)

28


To actually convert tokenized documents to vectors:

In [40]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 1), (1, 1)]


The function doc2bow() simply counts the number of occurances of each distint word, converts the word to its integer word id and returns the result as a bag-of-words-- a sparse vector, in  the form of [(word_id, word_count),....].\

As the token_id is 1 for "human" and 0 for "computer", the new document "Human computer interaction" will be transformed to [(1, 1), (0, 1)]. The words "computer" and "human" exist in the dictionary and appear once. Thus they become (1,1), (0, 1) repectively in the sparse vector. The word "interaction" doesn't exist in the dictionary and, thus, will not show up in the sparse vector. The other ten dictionary words, that appear(implicitly) zero times, will not show up in the sparse vector and there will never be a element in the sparse vector like (3,0).



In [41]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus)
for c in corpus:
    print(c)

2019-10-21 21:29:58,390 : INFO : storing corpus in Matrix Market format to C:\Users\nsavi\AppData\Local\Temp\deerwester.mm
2019-10-21 21:29:58,392 : INFO : saving sparse matrix to C:\Users\nsavi\AppData\Local\Temp\deerwester.mm
2019-10-21 21:29:58,394 : INFO : PROGRESS: saving document #0
2019-10-21 21:29:58,395 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2019-10-21 21:29:58,397 : INFO : saving MmCorpus index to C:\Users\nsavi\AppData\Local\Temp\deerwester.mm.index


[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


## Corpus Streaming- One Document at a Time

Note that corpus above resides fully in memory, as a plain Python list. In this simple example, it doesn't matter much, but just to make things clear, let's assume there are millions of documents in the corpus. Storing all of them in RAM won't do. Instead, let's assume the documents are stored in a file on disk, one document per line.\
Gensim only requires that a corpus be able to return one document vector at a time:

In [6]:
from smart_open import smart_open
class MyCorpus(object):
    def __iter__(self):
        for line in smart_open('datasets/mycorpus.txt', 'rb'):
            # assume there is one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

The assumption that each document occupies one line in a single file is not important; you can design the `__iter__` function to fit your input format.
Just pass your input to retrive a clean list of tokens in each document, then convert the tokens via dictionary to their IDs and yeild the resulting sparse vector inside `__iter__`.

In [7]:
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x0547D650>


`corpus_memory_friendly` is now an object. We didn't define any way to `print` it, so print just outputs address of the object in memory. Not very useful. To see the constituent vectors, let's iterate over  the corpus and print each document vector.

In [8]:
for vector in corpus_memory_friendly: # Load one vector into memory at a time
    print(vector)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


FileNotFoundError: [Errno 2] No such file or directory: 'datasets/mycorpus.txt'