# A vector space model implementation using NLTK (Natural Language ToolKit) and Gensim

We first import the functions provided by NLTK to perform tokenizing considering punctuation signs.

In [1]:
from nltk.tokenize import wordpunct_tokenize

Next, we import required functions to filter-out stopwords for the English language.

In [2]:
from nltk.corpus import stopwords

Now we import the function that implements the Porter's stemming algorithm.

In [3]:
from nltk.stem import PorterStemmer

Now we provide a sample document collection, containing 9 (very brief) documents.

In [6]:
sample_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"
]

The first step is aimed at preprocessing each document in the collection. We write a function that receives a textual document in string format, and returns a list containing all STEMS in the collection whose associated token is longer than 2 characters and is NOT an (English) stopword.

In [4]:
def preprocess_document(doc):
    stopset = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    tokens = wordpunct_tokenize(doc)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Let us now, for instance, tokenize the first document in the collection.

In [7]:
print(preprocess_document(sample_corpus[1]))

['survey', 'user', 'opinion', 'comput', 'system', 'respons', 'time']


Once all documents in the collection have been preprocessed, we need to create a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations.

In [8]:
from gensim import corpora

def create_dictionary(docs):
    pdocs = [preprocess_document(doc) for doc in docs]
    dictionary = corpora.Dictionary(pdocs)
    dictionary.save('/tmp/vsm.dict')
    return dictionary

Let us call the create_dictionary function feeding it with the complete corpus. Note that it is possible to save the generated dictionary to disk if required.

In [9]:
dict = create_dictionary(sample_corpus)
print(dict)

Dictionary(34 unique tokens: ['abc', 'applic', 'comput', 'human', 'interfac']...)


Now we have built the dictionary containing the vocabulary that we will use for indexing. Now we write a function that create the bag of words-based representation for each document in the collection.

In [10]:
def docs2bows(corpus, dictionary):
    docs = [preprocess_document(d) for d in corpus]
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('/tmp/vsm_docs.mm', vectors)
    return vectors

Note that it is possible to save the generated BOW-based corpus if we wish. For doing so, we need to import the corpora module from Gensim. Let us now generate the BOWs for the complete corpus.

In [12]:
bows = docs2bows(sample_corpus, dict)
print(bows)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)], [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)], [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)], [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(24, 1), (26, 1), (27, 1), (28, 1)], [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(9, 1), (26, 1), (29, 1)]]


These are pairs (word identifier, frequency). Let us now convert them into something a bit more readable.

In [13]:
for v in bows:
    tvec = [(dict[id], freq) for (id, freq) in v]
    print(tvec)

[('abc', 1), ('applic', 1), ('comput', 1), ('human', 1), ('interfac', 1), ('lab', 1), ('machin', 1)]
[('comput', 1), ('opinion', 1), ('respons', 1), ('survey', 1), ('system', 1), ('time', 1), ('user', 1)]
[('interfac', 1), ('system', 1), ('user', 1), ('ep', 1), ('manag', 1)]
[('human', 1), ('system', 2), ('ep', 1), ('engin', 1), ('test', 1)]
[('respons', 1), ('time', 1), ('user', 1), ('error', 1), ('measur', 1), ('perceiv', 1), ('relat', 1)]
[('binari', 1), ('gener', 1), ('random', 1), ('tree', 1), ('unord', 1)]
[('tree', 1), ('graph', 1), ('intersect', 1), ('path', 1)]
[('tree', 1), ('graph', 1), ('minor', 1), ('order', 1), ('quasi', 1), ('well', 1), ('width', 1)]
[('survey', 1), ('graph', 1), ('minor', 1)]


These are basically TF-weighted vectors. We now want to convert these vectors into their TF-IDF weighted counterparts. We need, however, to import the models module from Gensim.

In [14]:
from gensim import models

def create_TF_IDF_model(corpus):
    dictionary = create_dictionary(corpus)
    docs2bows(corpus, dictionary)
    loaded_corpus = corpora.MmCorpus('/tmp/vsm_docs.mm')
    tfidf = models.TfidfModel(loaded_corpus)
    return tfidf, dictionary

Let us now create the TF-IDF model.

In [16]:
tfidfm = create_TF_IDF_model(sample_corpus)
print(tfidfm)

(<gensim.models.tfidfmodel.TfidfModel object at 0x7f38f0681b38>, <gensim.corpora.dictionary.Dictionary object at 0x7f38f0674e10>)


As can be seen, a complex object is returned that contains the TF-IDF model and the associated dictionary. Let us now take a closer look of such a TF-IDF model.

In [17]:
print(tfidfm[0].__dict__)

{'id2word': None, 'wlocal': <function identity at 0x7f38f1b77048>, 'wglobal': <function df2idf at 0x7f38f1446840>, 'normalize': True, 'num_docs': 9, 'num_nnz': 50, 'idfs': {0: 3.1699250014423126, 1: 3.1699250014423126, 2: 2.1699250014423126, 3: 2.1699250014423126, 4: 2.1699250014423126, 5: 3.1699250014423126, 6: 3.1699250014423126, 7: 3.1699250014423126, 8: 2.1699250014423126, 9: 2.1699250014423126, 10: 1.5849625007211563, 11: 2.1699250014423126, 12: 1.5849625007211563, 13: 2.1699250014423126, 14: 3.1699250014423126, 15: 3.1699250014423126, 16: 3.1699250014423126, 17: 3.1699250014423126, 18: 3.1699250014423126, 19: 3.1699250014423126, 20: 3.1699250014423126, 21: 3.1699250014423126, 22: 3.1699250014423126, 23: 3.1699250014423126, 24: 1.5849625007211563, 25: 3.1699250014423126, 26: 1.5849625007211563, 27: 3.1699250014423126, 28: 3.1699250014423126, 29: 2.1699250014423126, 30: 3.1699250014423126, 31: 3.1699250014423126, 32: 3.1699250014423126, 33: 3.1699250014423126}, 'smartirs': None, 's

We finally create a function that given the corpus and an user-provided query provides a document ranking sorted in descending order of relevance (according to the cosine measure)

In [18]:
from operator import itemgetter
from gensim import similarities

def launch_query(corpus, q, filename='/tmp/vsm_docs.mm'):
    tfidf, dictionary = create_TF_IDF_model(corpus)
    loaded_corpus = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_corpus, num_features=len(dictionary))
    pq = preprocess_document(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    for doc, score in ranking:
        print("[ Score = " + "%.3f" % round(score,3) + " ] " + corpus[doc]); 

And now we can launch any query we see fit to our newly created Information Retrieval engine.

In [19]:
launch_query(sample_corpus, "human interface systems")

[ Score = 0.547 ] System and human system engineering testing of EPS
[ Score = 0.486 ] The EPS user interface management system
[ Score = 0.475 ] Human machine interface for lab abc computer applications
[ Score = 0.173 ] A survey of user opinion of computer system response time
[ Score = 0.000 ] Relation of user perceived response time to error measurement
[ Score = 0.000 ] The generation of random binary unordered trees
[ Score = 0.000 ] The intersection graph of paths in trees
[ Score = 0.000 ] Graph minors IV Widths of trees and well quasi ordering
[ Score = 0.000 ] Graph minors A survey
