<h1>Corpora and Vector Spaces</h1>

In [1]:
import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

<h2>From strings to vectors</h2>





In [20]:
#documents represented as strings
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

First step is to tokenize the documents and remove common words using a (toy) stoplist, as well as remove words that only appear once in the corpus.

In [4]:
from pprint import pprint #prettyprinter
from collections import defaultdict

#remove stopwords and tokenize

stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

#remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] +=1
        
        
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


To convert the documents to vectors we will use the bag of words representation, in which each document is represented by one vector where each vector element represents a question-answer pair.

Note: it is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary.

In [5]:
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('deerwester.dict') #store the dic for future reference
print(dictionary)

2021-10-26 21:59:09,893 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-10-26 21:59:09,898 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2021-10-26 21:59:09,903 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-10-26T21:59:09.901191', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18362-SP0', 'event': 'created'}
2021-10-26 21:59:09,905 : INFO : Dictionary lifecycle event {'fname_or_handle': 'deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-10-26T21:59:09.905334', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Wind

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


In the above code we are assigning unique numeric ids to each word that appears in our dictionary using the gensim Dictionary class.

We can see that there are 12 unique tokens in the corpus, meaning that each document will be represented by 12 numbers -- i.e. a 12 dimensional vector.

In [6]:
#lets see the mapping between the words and their ids
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [7]:
#lets try running a tokenized document and converting it into a vector

new_doc = "Computer Human Interface Response"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 1), (1, 1), (2, 1), (3, 1)]


The above vector can be interpreted as a series of question and answers represented by the vectors.

The first tuple is a numeric representation of the question: "how many times does token 0 (computer) appear in the document?", and the answer "1", because the word Computer appears in the new document one time. The remaining tuples are representing similar question/answer pairs: there is one instance of the token 'human'. There is one instance of the token 'interface'. There is one instance of the token 'response'.<br>

<b>Note:</b> The above function 'doc2bow' counts the number of occurences of each distinct word, converts the word to its integer id and returns the result as a vector.

In [8]:
#storing the corpus to disk for later use.
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus) 
print(corpus)

2021-10-26 21:59:09,955 : INFO : storing corpus in Matrix Market format to deerwester.mm
2021-10-26 21:59:09,958 : INFO : saving sparse matrix to deerwester.mm
2021-10-26 21:59:09,960 : INFO : PROGRESS: saving document #0
2021-10-26 21:59:09,961 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2021-10-26 21:59:09,962 : INFO : saving MmCorpus index to deerwester.mm.index


[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]


<h1> Corpus streaming -- one document at a time. </h1>

With toy examples it is acceptable to store the corpus on disk. However, with large number of documents it becomes an issue of storage.
In this circumstance it is preferable to store the documents in a file and read them into gensim one document at a time.

In [9]:
dict_STF = corpora.Dictionary(
   simple_preprocess(line, deacc =True) for line in open("mycorpus.txt", encoding="utf-8")
)

2021-10-26 21:59:09,981 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-10-26 21:59:09,984 : INFO : built Dictionary(41 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 67 corpus positions)
2021-10-26 21:59:09,986 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(41 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 67 corpus positions)", 'datetime': '2021-10-26T21:59:09.986535', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18362-SP0', 'event': 'created'}


In [10]:
print(dict_STF)

Dictionary(41 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...)


In [11]:
print(dict_STF.token2id)

{'abc': 0, 'applications': 1, 'computer': 2, 'for': 3, 'human': 4, 'interface': 5, 'lab': 6, 'machine': 7, 'of': 8, 'opinion': 9, 'response': 10, 'survey': 11, 'system': 12, 'time': 13, 'user': 14, 'eps': 15, 'management': 16, 'the': 17, 'and': 18, 'engineering': 19, 'testing': 20, 'error': 21, 'measurement': 22, 'perceived': 23, 'relation': 24, 'to': 25, 'binary': 26, 'generation': 27, 'random': 28, 'trees': 29, 'unordered': 30, 'graph': 31, 'in': 32, 'intersection': 33, 'paths': 34, 'iv': 35, 'minors': 36, 'ordering': 37, 'quasi': 38, 'well': 39, 'widths': 40}


In [12]:
class MyCorpus:
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

In [13]:
corpus_memory_friendly = MyCorpus() #corpus is now an object

In [14]:
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x0000022DE26215E0>


In [16]:
#this creates a more memory friendly corpus object because at most one vector resides in RAM at a time...
for vector in corpus_memory_friendly:
    print(vector)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


In [18]:
#it is also possible to construct the dictionary without loading all texts into memory:

dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
#remove stop words and words that appear only once
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]

once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) #remove stop words and words that appear only once
dictionary.compactify() #remove gaps in id sequence after words that were removed
print(dictionary)

2021-10-26 22:07:43,807 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-10-26 22:07:43,810 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)
2021-10-26 22:07:43,811 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)", 'datetime': '2021-10-26T22:07:43.811296', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18362-SP0', 'event': 'created'}


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
