# Core concepts

The whole gensim package revolves around the concepts of corpus, vector and model.

### Corpus
    
A collection of digital documents. This collection is used to automatically infer the structure of the documents, their topics, etc. For this reason, the collection is also called a training corpus. This inferred latent structure can be later used to assign topics to new documents, which did not appear in the training corpus. No human intervention (such as tagging the documents by hand, or creating other metadata) is required.

### Vector
    
In the Vector Space Model (VSM), each document is represented by an array of features. For example, a single feature may be thought of as a question-answer pair:

* How many times does the word splonge appear in the document? Zero.
* How many paragraphs does the document consist of? Two.
* How many fonts does the document use? Five.

The question is usually represented only by its integer id (such as 1, 2 and 3 here), so that the representation of this document becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0). If we know all the questions in advance, we may leave them implicit and simply write (0.0, 2.0, 5.0). This sequence of answers can be thought of as a vector (in this case a 3-dimensional vector). For practical purposes, only questions to which the answer is (or can be converted to) a single real number are allowed.

The questions are the same for each document, so that looking at two vectors (representing two documents), we will hopefully be able to make conclusions such as “The numbers in these two vectors are very similar, and therefore the original documents must be similar, too”. Of course, whether such conclusions correspond to reality depends on how well we picked our questions.

### Sparse Vector

Typically, the answer to most questions will be 0.0. To save space, we omit them from the document’s representation, and write only (2, 2.0), (3, 5.0) (note the missing (1, 0.0)). Since the set of all questions is known in advance, all the missing features in a sparse representation of a document can be unambiguously resolved to zero, 0.0.

Gensim does not prescribe any specific corpus format; a corpus is anything that, when iterated over, successively yields these sparse vectors. For example, set((((2, 2.0), (3, 5.0)), ((0, 1.0), (3, 1.0)))) is a trivial corpus of two documents, each with two non-zero feature-answer pairs.

### Model

We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.

For example, consider a transformation that takes a raw count of word occurrences and weights them so that common words are discounted and rare words are promoted. The exact amount that any particular word is weighted by is determined by the relative frequency of that word in the training corpus. When we apply this model we transform from one vector space (containing the raw word counts) to another (containing the weighted counts).

***

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)

In [4]:
from gensim import corpora, models, similarities

2018-01-28 19:27:56,941 : INFO : 'pattern' package not found; tag filters are not available for English


First, let’s import gensim and create a small corpus of nine documents and twelve features

In [2]:
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
          [(3, 1.0), (5, 1.0), (6, 1.0)],
          [(9, 1.0)],
          [(9, 1.0), (10, 1.0)],
          [(9, 1.0), (10, 1.0), (11, 1.0)],
          [(8, 1.0), (10, 1.0), (11, 1.0)]]

Next, let’s initialize a transformation:

In [5]:
tfidf = models.TfidfModel(corpus)

2018-01-28 19:28:02,605 : INFO : collecting document frequencies
2018-01-28 19:28:02,607 : INFO : PROGRESS: processing document #0
2018-01-28 19:28:02,610 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)


A transformation is used to convert documents from one vector representation into another:



In [6]:
vec = [(0, 1), (4, 1)]
print(tfidf[vec])

[(0, 0.8075244024440723), (4, 0.5898341626740045)]


Here, we used Tf-Idf, a simple transformation which takes documents represented as bag-of-words counts and applies a weighting which discounts common terms (or, equivalently, promotes rare terms). It also scales the resulting vector to unit length (in the Euclidean norm).

To transform the whole corpus via TfIdf and index it, in preparation for similarity queries:



In [7]:
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

2018-01-28 19:28:55,079 : INFO : creating sparse index
2018-01-28 19:28:55,081 : INFO : creating sparse matrix from corpus
2018-01-28 19:28:55,083 : INFO : PROGRESS: at document #0
2018-01-28 19:28:55,089 : INFO : created <9x12 sparse matrix of type '<class 'numpy.float32'>'
	with 28 stored elements in Compressed Sparse Row format>


and to query the similarity of our query vector vec against every document in the corpus:

In [8]:
sims = index[tfidf[vec]]
print(list(enumerate(sims)))

[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


How to read this output? Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.

Thus, according to TfIdf document representation and cosine similarity measure, the most similar to our query document vec is document no. 3, with a similarity score of 82.1%. Note that in the TfIdf representation, any documents which do not share any common features with vec at all (documents no. 4–8) get a similarity score of 0.0.

Sources : 
* https://radimrehurek.com/gensim/intro.html
* https://radimrehurek.com/gensim/tutorial.html