# A vector space model implementation using NLTK (Natural Language ToolKit) and Gensim

We first install the NLTK toolkit

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


We also need to download the NLTK data bundle

In [2]:
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_

[nltk_data]    |   Unzipping corpora\pros_cons.zip.
[nltk_data]    | Downloading package qc to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\qc.zip.
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package rte to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\rte.zip.
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |   

[nltk_data]    |   Unzipping corpora\nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping stemmers\porter_test.zip.
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\wmt15_eval.zip.
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\enriq\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping misc\mwa_ppdb.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all


True

We now install the gensim package

In [3]:
pip install gensim

Collecting gensim
  Downloading gensim-3.8.3-cp37-cp37m-win_amd64.whl (24.2 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-3.0.0.tar.gz (113 kB)
Collecting Cython==0.29.14
  Downloading Cython-0.29.14-cp37-cp37m-win_amd64.whl (1.7 MB)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py): started
  Building wheel for smart-open (setup.py): finished with status 'done'
  Created wheel for smart-open: filename=smart_open-3.0.0-py3-none-any.whl size=107102 sha256=8fccac4ee89de8b2353dae063f73cd18397ce3e00092da0561d1a3493eb7770b
  Stored in directory: c:\users\enriq\appdata\local\pip\cache\wheels\83\a6\12\bf3c1a667bde4251be5b7a3368b2d604c9af2105b5c1cb1870
Successfully built smart-open
Installing collected packages: smart-open, Cython, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.15
    Uninstalling Cython-0.29.15:
      Successfully uninstalled Cython-0.29.15
Successfully installed Cython-0.29.14 gensim-

All the required software inst now installed. We now import the functions provided by NLTK to perform tokenizing considering punctuation signs.

In [4]:
from nltk.tokenize import wordpunct_tokenize

Next, we import required functions to filter-out stopwords for the English language.

In [5]:
from nltk.corpus import stopwords

Now we import the function that implements the Porter's stemming algorithm.

In [6]:
from nltk.stem import PorterStemmer

Now we provide a sample document collection, containing 9 (very brief) documents.

In [7]:
sample_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"
]

The first step is aimed at preprocessing each document in the collection. We write a function that receives a textual document in string format, and returns a list containing all STEMS in the collection whose associated token is longer than 2 characters and is NOT an (English) stopword.

In [8]:
def preprocess_document(doc):
    # hallamos el set de palabras que no deben ser incluidas (artículos y palabras comunes)
    stopset = set(stopwords.words('english'))
    # Quitar prefijos y sufijos para obtener solo la raíz
    stemmer = PorterStemmer()
    # Esto nos da los tokens (palabras) de los documentos y los lista en la variable tokens
    tokens = wordpunct_tokenize(doc)
    # clean guarda las palabras (en minuscula) que no están incluidas en stopset
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

Let us now, for instance, tokenize the first document in the collection.

In [9]:
print(preprocess_document(sample_corpus[1]))

['survey', 'user', 'opinion', 'comput', 'system', 'respons', 'time']


Once all documents in the collection have been preprocessed, we need to create a dictionary containing the mappings WORD_ID -> WORD. This dictionary is required to create the vector-based word representations.

In [10]:
from gensim import corpora

# Conjunto de diferentes palabras que pueden ser encontradas en la colección
def create_dictionary(docs):
    # Lista de documentos procesados (lista de listas de palabras)
    pdocs = [preprocess_document(doc) for doc in docs]
    # Esto construye el diccionario
    dictionary = corpora.Dictionary(pdocs)
    # Lo guardamos en un archivo
    dictionary.save('/tmp/vsm.dict')
    return dictionary

Let us call the create_dictionary function feeding it with the complete corpus. Note that it is possible to save the generated dictionary to disk if required.

In [11]:
dict = create_dictionary(sample_corpus)
print(dict)

Dictionary(34 unique tokens: ['abc', 'applic', 'comput', 'human', 'interfac']...)


Now we have built the dictionary containing the vocabulary that we will use for indexing. Now we write a function that create the bag of words-based representation for each document in the collection.

In [12]:
# Convertir el corpus en una bolsa de palabras. Esto se usará para ranking
def docs2bows(corpus, dictionary):
    docs = [preprocess_document(d) for d in corpus]
    # Obtenemos el conjunto de frecuencias para cada término
    vectors = [dictionary.doc2bow(doc) for doc in docs]
    corpora.MmCorpus.serialize('/tmp/vsm_docs.mm', vectors)
    return vectors

Note that it is possible to save the generated BOW-based corpus if we wish. For doing so, we need to import the corpora module from Gensim. Let us now generate the BOWs for the complete corpus.

In [13]:
# Obtenemos un conjunto de vectores (documentos) que contienen una serie de pajeras (identificador de término, frecuencia)
bows = docs2bows(sample_corpus, dict)
print(bows)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)], [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)], [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)], [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(24, 1), (26, 1), (27, 1), (28, 1)], [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(9, 1), (26, 1), (29, 1)]]


These are pairs (word identifier, frequency). Let us now convert them into something a bit more readable.

In [14]:
# Igual, pero obtenemos el término en sí
for v in bows:
    tvec = [(dict[id], freq) for (id, freq) in v]
    print(tvec)

[('abc', 1), ('applic', 1), ('comput', 1), ('human', 1), ('interfac', 1), ('lab', 1), ('machin', 1)]
[('comput', 1), ('opinion', 1), ('respons', 1), ('survey', 1), ('system', 1), ('time', 1), ('user', 1)]
[('interfac', 1), ('system', 1), ('user', 1), ('ep', 1), ('manag', 1)]
[('human', 1), ('system', 2), ('ep', 1), ('engin', 1), ('test', 1)]
[('respons', 1), ('time', 1), ('user', 1), ('error', 1), ('measur', 1), ('perceiv', 1), ('relat', 1)]
[('binari', 1), ('gener', 1), ('random', 1), ('tree', 1), ('unord', 1)]
[('tree', 1), ('graph', 1), ('intersect', 1), ('path', 1)]
[('tree', 1), ('graph', 1), ('minor', 1), ('order', 1), ('quasi', 1), ('well', 1), ('width', 1)]
[('survey', 1), ('graph', 1), ('minor', 1)]


These are basically TF-weighted vectors. We now want to convert these vectors into their TF-IDF weighted counterparts. We need, however, to import the models module from Gensim.

In [15]:
from gensim import models

def create_TF_IDF_model(corpus):
    # Creamos el diccionario y la bolsa de palabras
    dictionary = create_dictionary(corpus)
    docs2bows(corpus, dictionary)
    # Cargamos el corpus (la bolsa anterior)
    loaded_corpus = corpora.MmCorpus('/tmp/vsm_docs.mm')
    #Calculamos el tfidf para el corpus
    tfidf = models.TfidfModel(loaded_corpus)
    return tfidf, dictionary

Let us now create the TF-IDF model.

In [16]:
tfidfm = create_TF_IDF_model(sample_corpus)
print(tfidfm)

(<gensim.models.tfidfmodel.TfidfModel object at 0x7f111aa2dd30>, <gensim.corpora.dictionary.Dictionary object at 0x7f114af33860>)


As can be seen, a complex object is returned that contains the TF-IDF model and the associated dictionary. Let us now take a closer look of such a TF-IDF model.

In [17]:
print(tfidfm[0].__dict__)

{'id2word': None, 'wlocal': <function identity at 0x7f114bb09950>, 'wglobal': <function df2idf at 0x7f114af221e0>, 'normalize': True, 'num_docs': 9, 'num_nnz': 50, 'idfs': {0: 3.1699250014423126, 1: 3.1699250014423126, 2: 2.1699250014423126, 3: 2.1699250014423126, 4: 2.1699250014423126, 5: 3.1699250014423126, 6: 3.1699250014423126, 7: 3.1699250014423126, 8: 2.1699250014423126, 9: 2.1699250014423126, 10: 1.5849625007211563, 11: 2.1699250014423126, 12: 1.5849625007211563, 13: 2.1699250014423126, 14: 3.1699250014423126, 15: 3.1699250014423126, 16: 3.1699250014423126, 17: 3.1699250014423126, 18: 3.1699250014423126, 19: 3.1699250014423126, 20: 3.1699250014423126, 21: 3.1699250014423126, 22: 3.1699250014423126, 23: 3.1699250014423126, 24: 1.5849625007211563, 25: 3.1699250014423126, 26: 1.5849625007211563, 27: 3.1699250014423126, 28: 3.1699250014423126, 29: 2.1699250014423126, 30: 3.1699250014423126, 31: 3.1699250014423126, 32: 3.1699250014423126, 33: 3.1699250014423126}, 'smartirs': None, 's

We finally create a function that given the corpus and an user-provided query provides a document ranking sorted in descending order of relevance (according to the cosine measure)

In [18]:
from operator import itemgetter
from gensim import similarities

def launch_query(corpus, q, filename='/tmp/vsm_docs.mm'):
    tfidf, dictionary = create_TF_IDF_model(corpus)
    loaded_corpus = corpora.MmCorpus(filename)
    index = similarities.MatrixSimilarity(loaded_corpus, num_features=len(dictionary))
    pq = preprocess_document(q)
    vq = dictionary.doc2bow(pq)
    qtfidf = tfidf[vq]
    sim = index[qtfidf]
    ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
    for doc, score in ranking:
        print("[ Score = " + "%.3f" % round(score,3) + " ] " + corpus[doc]); 

And now we can launch any query we see fit to our newly created Information Retrieval engine.

In [19]:
launch_query(sample_corpus, "human interface systems")

[ Score = 0.547 ] System and human system engineering testing of EPS
[ Score = 0.486 ] The EPS user interface management system
[ Score = 0.475 ] Human machine interface for lab abc computer applications
[ Score = 0.173 ] A survey of user opinion of computer system response time
[ Score = 0.000 ] Relation of user perceived response time to error measurement
[ Score = 0.000 ] The generation of random binary unordered trees
[ Score = 0.000 ] The intersection graph of paths in trees
[ Score = 0.000 ] Graph minors IV Widths of trees and well quasi ordering
[ Score = 0.000 ] Graph minors A survey
