# Relevance-ranked search

This notebook is an expansion of https://github.com/mathiascreutz/nlp-tutorials/blob/main/tutorials/relevance-ranked-search.ipynb.

Let's return to the indexing of toy data, as we did in the tutorial on Boolean search. This new tutorial has also been inspired by course material by Filip Ginter in Turku.

Our documents now look slightly different:

In [11]:
documents = ["This is a silly silly silly example",
             "A better example",
             "Nothing to see here nor here nor here",
             "This is a great example and a long example too"]

## Bag of Words

We can index them as we did before:

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

cv = CountVectorizer(lowercase=True, binary=True)
binary_dense_matrix = cv.fit_transform(documents).T.todense()

print("Term-document matrix:\n")
print(binary_dense_matrix)

Term-document matrix:

[[1 0 0]
 [0 1 0]
 [0 1 0]
 [1 0 0]
 [1 1 1]
 [0 0 1]
 [0 1 0]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [1 1 0]
 [1 0 0]]


Next, we'll remove the `binary=True` optional argument from the `CountVectorizer` constructor. The default value is `binary=False`. What change can we observe?

In [13]:
cv = CountVectorizer(lowercase=True)
dense_matrix = cv.fit_transform(documents).T.todense()

print("Term-document matrix:\n")
print(dense_matrix)

Term-document matrix:

[[0 0 0 1]
 [0 1 0 0]
 [1 1 0 2]
 [0 0 0 1]
 [0 0 3 0]
 [1 0 0 1]
 [0 0 0 1]
 [0 0 2 0]
 [0 0 1 0]
 [0 0 1 0]
 [3 0 0 0]
 [1 0 0 1]
 [0 0 1 0]
 [0 0 0 1]]


Let's recall what term each row in the matrix corresponds to:

In [14]:
for (row, term) in enumerate(cv.get_feature_names_out()):
    print("Row", row, "is the vector for term:", term)

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out'

Now, if we run a query on the term "example", we get:

In [15]:
t2i = cv.vocabulary_  # shorter notation: t2i = term-to-index
print("Query: example")
print(dense_matrix[t2i["example"]])

Query: example
[[1 1 0 2]]


Instead of seeing *whether* a term occurs in a document, we now see *how many times* the term occurs in each document:

In [16]:
hits_list = np.array(dense_matrix[t2i["example"]])[0]

for i, nhits in enumerate(hits_list):
    print("Example occurs", nhits, "time(s) in document:", documents[i])

Example occurs 1 time(s) in document: This is a silly silly silly example
Example occurs 1 time(s) in document: A better example
Example occurs 0 time(s) in document: Nothing to see here nor here nor here
Example occurs 2 time(s) in document: This is a great example and a long example too


When the number and sizes of the documents grow, we may think that the more times a search term occurs in a document, the more relevant the document is. So, if we search for "example" in our toy document collection, the fourth document is most relevant (2 hits), the first and second documents come next (1 hit each) and the third document is irrelevant (0 hits).

If we have multiple search terms, we might think that the more times the search terms occur in total in the document, the more relevant the document is.

Note that the bit-wise logical operators `AND (&)` and `OR (|)` will not work properly anymore when our matrix contains word counts. The same applies to `NOT (1 - x)`.

Let's search for the most relevant document for the query *better example*:

In [17]:
print("Query: better example")
print("Hits of better:        ", dense_matrix[t2i["better"]])
print("Hits of example:       ", dense_matrix[t2i["example"]])
print("Hits of better example:", dense_matrix[t2i["better"]] + dense_matrix[t2i["example"]])

Query: better example
Hits of better:         [[0 1 0 0]]
Hits of example:        [[1 1 0 2]]
Hits of better example: [[1 2 0 2]]


We just added the hits together. This means that we did not search for the phrase "better example", nor did we search for "better" AND "example". What we did search for was some kind of "better" OR "example", in which the sum of the number of occurrences of "better" and "example" in a document determines the relevance of the document.

This means that the second document, which contains one occurrence each of "better" and "example" is as good a hit as the fourth document, which contains two occurrences of "example" and no occurrence of "better".

Let's execute another query:

In [8]:
print("Query: silly example")
print("Hits of silly:        ", dense_matrix[t2i["silly"]])
print("Hits of example:      ", dense_matrix[t2i["example"]])
print("Hits of silly example:", dense_matrix[t2i["silly"]] + dense_matrix[t2i["example"]])

Query: silly example
Hits of silly:         [[3 0 0 0]]
Hits of example:       [[1 1 0 2]]
Hits of silly example: [[4 1 0 2]]


... and also rank (sort) the results by relevance. We leave out the document without a single hit:

In [9]:
# We need the np.array(...)[0] code here to convert the matrix to an ordinary list:
hits_list = np.array(dense_matrix[t2i["silly"]] + dense_matrix[t2i["example"]])[0]
print("Hits:", hits_list)

nhits_and_doc_ids = [ (nhits, i) for i, nhits in enumerate(hits_list) if nhits > 0 ]
print("List of tuples (nhits, doc_idx) where nhits > 0:", nhits_and_doc_ids)

ranked_nhits_and_doc_ids = sorted(nhits_and_doc_ids, reverse=True)
print("Ranked (nhits, doc_idx) tuples:", ranked_nhits_and_doc_ids)

print("\nMatched the following documents, ranked highest relevance first:")
for nhits, i in ranked_nhits_and_doc_ids:
    print("Score of 'silly example' is", nhits, "in document:", documents[i])

Hits: [4 1 0 2]
List of tuples (nhits, doc_idx) where nhits > 0: [(4, 0), (1, 1), (2, 3)]
Ranked (nhits, doc_idx) tuples: [(4, 0), (2, 3), (1, 1)]

Matched the following documents, ranked highest relevance first:
Score of 'silly example' is 4 in document: This is a silly silly silly example
Score of 'silly example' is 2 in document: This is a great example and a long example too
Score of 'silly example' is 1 in document: A better example


## Tf-idf

As we may guess, pure word counts are not a good indicator of relevance. Frequently occurring words are not usually very interesting from the point of view of information content.

One approach to weight terms (words) by their relevance is to use *term frequency / inverse document frequency (tf-idf)* weighting. There is another [tutorial on tf-idf](https://github.com/mathiascreutz/nlp-tutorials/blob/main/tutorials/tf-idf-gutenberg.ipynb) that illustrates how this weighting works.

As a matter of fact, the scikit-learn library makes it easy for us to compute the tf-idf scores of terms in a document collection. Instead of the class `CountVectorizer` we can use `TfidfVectorizer`: 

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

The TfidfVectorizer can be used with many different parameter values. One option is to count ordinary term frequencies. In this setup the resulting matrix should produce the same values as the one produced by the CountVectorizer:

In [19]:
# Parameters with which TfidfVectorizer does same thing as CountVectorizer
tfv1 = TfidfVectorizer(lowercase=True, sublinear_tf=False, use_idf=False, norm=None)
tf_matrix1 = tfv1.fit_transform(documents).T.todense()

print("TfidfVectorizer:")
print(tf_matrix1)

print("\nCountVectorizer:")
print(dense_matrix)

TfidfVectorizer:
[[0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 1. 0. 2.]
 [0. 0. 0. 1.]
 [0. 0. 3. 0.]
 [1. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 2. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [3. 0. 0. 0.]
 [1. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

CountVectorizer:
[[0 0 0 1]
 [0 1 0 0]
 [1 1 0 2]
 [0 0 0 1]
 [0 0 3 0]
 [1 0 0 1]
 [0 0 0 1]
 [0 0 2 0]
 [0 0 1 0]
 [0 0 1 0]
 [3 0 0 0]
 [1 0 0 1]
 [0 0 1 0]
 [0 0 0 1]]


The values are the same, except that the TfidfVectorizer produces floating-point values, whereas the CountVectorizer produces integer values.

Some useful parameters for the TfidfVectorizer are `sublinear_tf`, `use_idf` and `norm`.

`sublinear_tf=True` uses logarithmic word frequencies instead of linear ones. That is, if a term occurs 20 times, it is not 20 times more important than a term that occurs once:

In [20]:
tfv2 = TfidfVectorizer(lowercase=True, sublinear_tf=True, use_idf=False, norm=None)
tf_matrix2 = tfv2.fit_transform(documents).T.todense()

print("TfidfVectorizer (logarithmic term frequencies):")
print(tf_matrix2)

TfidfVectorizer (logarithmic term frequencies):
[[0.         0.         0.         1.        ]
 [0.         1.         0.         0.        ]
 [1.         1.         0.         1.69314718]
 [0.         0.         0.         1.        ]
 [0.         0.         2.09861229 0.        ]
 [1.         0.         0.         1.        ]
 [0.         0.         0.         1.        ]
 [0.         0.         1.69314718 0.        ]
 [0.         0.         1.         0.        ]
 [0.         0.         1.         0.        ]
 [2.09861229 0.         0.         0.        ]
 [1.         0.         0.         1.        ]
 [0.         0.         1.         0.        ]
 [0.         0.         0.         1.        ]]


`use_idf=True` factors in the inverse document frequencies. The more documents a term occurs in, the less relevant the term is, in general:

In [21]:
tfv3 = TfidfVectorizer(lowercase=True, sublinear_tf=True, use_idf=True, norm=None)
tf_matrix3 = tfv3.fit_transform(documents).T.todense()

print("TfidfVectorizer (logarithmic term frequencies and inverse document frequencies):")
print(tf_matrix3)

TfidfVectorizer (logarithmic term frequencies and inverse document frequencies):
[[0.         0.         0.         1.91629073]
 [0.         1.91629073 0.         0.        ]
 [1.22314355 1.22314355 0.         2.07096206]
 [0.         0.         0.         1.91629073]
 [0.         0.         4.02155128 0.        ]
 [1.51082562 0.         0.         1.51082562]
 [0.         0.         0.         1.91629073]
 [0.         0.         3.24456225 0.        ]
 [0.         0.         1.91629073 0.        ]
 [0.         0.         1.91629073 0.        ]
 [4.02155128 0.         0.         0.        ]
 [1.51082562 0.         0.         1.51082562]
 [0.         0.         1.91629073 0.        ]
 [0.         0.         0.         1.91629073]]


If additionally, we use the L2 norm `norm="l2"` we normalize all document vectors (columns) to have a (Euclidian) length of one:

In [22]:
tfv4 = TfidfVectorizer(lowercase=True, sublinear_tf=True, use_idf=True, norm="l2")
tf_matrix4 = tfv4.fit_transform(documents).T.todense()

print("TfidfVectorizer (logarithmic term frequencies and inverse document frequencies, normalized document vectors):")
print(tf_matrix4)

TfidfVectorizer (logarithmic term frequencies and inverse document frequencies, normalized document vectors):
[[0.         0.         0.         0.39494151]
 [0.         0.84292635 0.         0.        ]
 [0.25939836 0.53802897 0.         0.42681878]
 [0.         0.         0.         0.39494151]
 [0.         0.         0.65482842 0.        ]
 [0.32040859 0.         0.         0.31137642]
 [0.         0.         0.         0.39494151]
 [0.         0.         0.52831145 0.        ]
 [0.         0.         0.31202925 0.        ]
 [0.         0.         0.31202925 0.        ]
 [0.85287113 0.         0.         0.        ]
 [0.32040859 0.         0.         0.31137642]
 [0.         0.         0.31202925 0.        ]
 [0.         0.         0.         0.39494151]]


We can search the index in the same way as above, even if we use tf-idf weighting:

In [23]:
print("Query: silly example")
print("Hits of silly:        ", tf_matrix4[t2i["silly"]])
print("Hits of example:      ", tf_matrix4[t2i["example"]])
print("Hits of silly example:", tf_matrix4[t2i["silly"]] + tf_matrix4[t2i["example"]])

Query: silly example
Hits of silly:         [[0.85287113 0.         0.         0.        ]]
Hits of example:       [[0.25939836 0.53802897 0.         0.42681878]]
Hits of silly example: [[1.11226949 0.53802897 0.         0.42681878]]


... and we can rank the documents using the tf-idf scores:

In [24]:
hits_list4 = np.array(tf_matrix4[t2i["silly"]] + tf_matrix4[t2i["example"]])[0]
print("Hits:", hits_list4)

hits_and_doc_ids = [ (hits, i) for i, hits in enumerate(hits_list4) if hits > 0 ]
print("List of tuples (hits, doc_idx) where hits > 0:", hits_and_doc_ids)

ranked_hits_and_doc_ids = sorted(hits_and_doc_ids, reverse=True)
print("Ranked (hits, doc_idx) tuples:", ranked_hits_and_doc_ids)

print("\nMatched the following documents, ranked highest relevance first:")
for hits, i in ranked_hits_and_doc_ids:
    print("Score of 'silly example' is {:.4f} in document: {:s}".format(hits, documents[i]))

Hits: [1.11226949 0.53802897 0.         0.42681878]
List of tuples (hits, doc_idx) where hits > 0: [(1.1122694945914164, 0), (0.5380289691033573, 1), (0.42681878177600086, 3)]
Ranked (hits, doc_idx) tuples: [(1.1122694945914164, 0), (0.5380289691033573, 1), (0.42681878177600086, 3)]

Matched the following documents, ranked highest relevance first:
Score of 'silly example' is 1.1123 in document: This is a silly silly silly example
Score of 'silly example' is 0.5380 in document: A better example
Score of 'silly example' is 0.4268 in document: This is a great example and a long example too


It makes sense that the document "This is a silly silly silly example" comes up on the top, but why does "A better example" now rank higher than "This is a great example and a long example too"? The former one contains only one occurrence of "example" whereas the latter one contains two. Can you figure out the reason?

### Cosine similarity

When we searched the index above, we scored the documents by summing together the tf-idf values of all the terms in the search query. A more sophisticated way is to transform the query itself into a document vector, in which we score each search term using tf-idf. We then compare the query vector to each document vector in the index. The more similar the query vector is to a document vector, the more relevant that document is for our search.

Let us first create a vector of our query:

In [25]:
query_vec4 = tfv4.transform(["silly example"]).todense()
print(query_vec4)

[[0.         0.         0.53802897 0.         0.         0.
  0.         0.         0.         0.         0.84292635 0.
  0.         0.        ]]


This is actually a matrix with one row (document-term matrix). Since we have looked at term-document matrices above, let's transpose, to understand better:

In [26]:
print(query_vec4.T)

[[0.        ]
 [0.        ]
 [0.53802897]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.84292635]
 [0.        ]
 [0.        ]
 [0.        ]]


We can see that only two terms have non-zero values, and they are (not surprisingly) "example" and "silly":

In [27]:
print("Tf-idf weight of 'example' on row", t2i["example"], "is:", query_vec4.T[t2i["example"]])
print("Tf-idf weight of 'silly' on row", t2i["silly"], "is: ", query_vec4.T[t2i["silly"]])

Tf-idf weight of 'example' on row 2 is: [[0.53802897]]
Tf-idf weight of 'silly' on row 10 is:  [[0.84292635]]


Make sure that you understand why the score of "silly" is higher than that of "example".

To compare two vectors we use *cosine similarity*, which measures the cosine of the angle between the document vectors. If all vectors are guaranteed to be of length 1, which they are when we use the L2 norm, the cosine similarity reduces to the dot product:

In [28]:
for i in range(0, 4):
    
    # Go through each column (document vector) in the index 
    doc_vector = tf_matrix4[:, i]
    
    # Compute the dot product between the query vector and the document vector
    # (Some extra stuff here to extract the number from the matrix data structure)
    score = np.array(np.dot(query_vec4, doc_vector))[0][0]
    
    print("The score of 'silly example' is {:.4f} in document: {:s}".format(score, documents[i]))

The score of 'silly example' is 0.8585 in document: This is a silly silly silly example
The score of 'silly example' is 0.2895 in document: A better example
The score of 'silly example' is 0.0000 in document: Nothing to see here nor here nor here
The score of 'silly example' is 0.2296 in document: This is a great example and a long example too


Because of the beauty with matrix and vector algebra, we don't actually need a loop, but we can do all calculations in one single dot product:

In [29]:
scores = np.dot(query_vec4, tf_matrix4)
print("The documents have the following cosine similarities to the query:", scores)

The documents have the following cosine similarities to the query: [[0.85847138 0.28947517 0.         0.22964087]]


If we want to rank the matching documents, we can do it like this:

In [30]:
ranked_scores_and_doc_ids = \
    sorted([ (score, i) for i, score in enumerate(np.array(scores)[0]) if score > 0], reverse=True)

for score, i in ranked_scores_and_doc_ids:
    print("The score of 'silly example' is {:.4f} in document: {:s}".format(score, documents[i]))

The score of 'silly example' is 0.8585 in document: This is a silly silly silly example
The score of 'silly example' is 0.2895 in document: A better example
The score of 'silly example' is 0.2296 in document: This is a great example and a long example too


## Scaling up to larger document collections with sparse matrices

As we saw in the tutorial on Boolean search, any real-size data requires us to use sparse matrices. Let us go though how to use sparse matrices with tf-idf weighting.

First we index the data:

In [23]:
tfv5 = TfidfVectorizer(lowercase=True, sublinear_tf=True, use_idf=True, norm="l2")
sparse_matrix = tfv5.fit_transform(documents).T.tocsr() # CSR: compressed sparse row format => order by terms

print("Sparse term-document matrix with tf-idf weights:")
print(sparse_matrix)

Sparse term-document matrix with tf-idf weights:
  (0, 3)	0.39494150730720773
  (1, 1)	0.8429263481500496
  (2, 0)	0.25939836420616813
  (2, 1)	0.5380289691033573
  (2, 3)	0.42681878177600086
  (3, 3)	0.39494150730720773
  (4, 2)	0.6548284187983
  (5, 0)	0.3204085857171691
  (5, 3)	0.31137642070883736
  (6, 3)	0.39494150730720773
  (7, 2)	0.5283114451514632
  (8, 2)	0.3120292501545813
  (9, 2)	0.3120292501545813
  (10, 0)	0.8528711303852483
  (11, 0)	0.3204085857171691
  (11, 3)	0.31137642070883736
  (12, 2)	0.3120292501545813
  (13, 3)	0.39494150730720773


Then we convert the query string to a sparse vector:

In [24]:
# The query vector is a horizontal vector, so in order to sort by terms, we need to use CSC
query_vec5 = tfv5.transform(["silly example"]).tocsc() # CSC: compressed sparse column format

print("Sparse one-row query matrix (horizontal vector):")
print(query_vec5)

Sparse one-row query matrix (horizontal vector):
  (0, 2)	0.5380289691033573
  (0, 10)	0.8429263481500496


Next we compute the cosine similarity (dot product). Since we are dealing with sparse matrices, any zero values are automatically left out:

In [25]:
hits = np.dot(query_vec5, sparse_matrix)

print("Matching documents and their scores:")
print(hits)

Matching documents and their scores:
  (0, 0)	0.858471381859184
  (0, 1)	0.2894751715944214
  (0, 3)	0.22964086915289256


We can access the document indexes like this:

In [26]:
print("The matching documents are:", hits.nonzero()[1])

The matching documents are: [0 1 3]


We can access the tf-idf scores like this:

In [27]:
print("The scores of the documents are:", np.array(hits[hits.nonzero()])[0])

The scores of the documents are: [0.85847138 0.28947517 0.22964087]


We can rank the documents by scores. It may be hard to see that this works, since the documents happen to be in the right order already.

In [28]:
ranked_scores_and_doc_ids = sorted(zip(np.array(hits[hits.nonzero()])[0], hits.nonzero()[1]), reverse=True)

for score, i in ranked_scores_and_doc_ids:
    print("The score of 'silly example' is {:.4f} in document: {:s}".format(score, documents[i]))

The score of 'silly example' is 0.8585 in document: This is a silly silly silly example
The score of 'silly example' is 0.2895 in document: A better example
The score of 'silly example' is 0.2296 in document: This is a great example and a long example too


### Gutenberg corpus

Let's finally index the Gutenberg corpus in NLTK, to get a feel for some real data.

We start by loading the data:

In [29]:
import sys
!{sys.executable} -m pip install nltk

import nltk
nltk.download(['gutenberg'])

booknames = nltk.corpus.gutenberg.fileids()

bookdata = list(nltk.corpus.gutenberg.raw(name) for name in booknames)

print("There are", len(bookdata), "books in the collection:", booknames)

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
Collecting regex>=2021.8.3
  Downloading regex-2022.10.31-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (770 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m770.5/770.5 kB[0m [31m94.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: regex, click, nltk
Successfully installed click-8.1.3 nltk-3.8.1 regex-2022.10.31


[nltk_data] Downloading package gutenberg to /home/jovyan/nltk_data...


There are 18 books in the collection: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


[nltk_data]   Unzipping corpora/gutenberg.zip.


Then we index it using the TfidfVectorizer:

In [30]:
gv = TfidfVectorizer(lowercase=True, sublinear_tf=True, use_idf=True, norm="l2")
g_matrix = gv.fit_transform(bookdata).T.tocsr()

print("Number of terms in vocabulary:", len(gv.get_feature_names_out()))

Number of terms in vocabulary: 42063


Let's create a function for searching this document collection:

In [31]:
def search_gutenberg(query_string):

    # Vectorize query string
    query_vec = gv.transform([ query_string ]).tocsc()

    # Cosine similarity
    hits = np.dot(query_vec, g_matrix)

    # Rank hits
    ranked_scores_and_doc_ids = \
        sorted(zip(np.array(hits[hits.nonzero()])[0], hits.nonzero()[1]),
               reverse=True)
    
    # Output result
    print("Your query '{:s}' matches the following documents:".format(query_string))
    for i, (score, doc_idx) in enumerate(ranked_scores_and_doc_ids):
        print("Doc #{:d} (score: {:.4f}): {:s}".format(i, score, booknames[doc_idx]))
    print()

... and run some searches:

In [32]:
search_gutenberg("alice")
search_gutenberg("alice entertained harriet")
search_gutenberg("whale hunter")
search_gutenberg("oh thy lord cometh")
search_gutenberg("which book should i read")

Your query 'alice' matches the following documents:
Doc #0 (score: 0.1046): carroll-alice.txt
Doc #1 (score: 0.0106): edgeworth-parents.txt
Doc #2 (score: 0.0092): chesterton-thursday.txt

Your query 'alice entertained harriet' matches the following documents:
Doc #0 (score: 0.0590): carroll-alice.txt
Doc #1 (score: 0.0505): austen-emma.txt
Doc #2 (score: 0.0092): edgeworth-parents.txt
Doc #3 (score: 0.0052): chesterton-thursday.txt
Doc #4 (score: 0.0045): austen-persuasion.txt
Doc #5 (score: 0.0043): milton-paradise.txt
Doc #6 (score: 0.0040): austen-sense.txt
Doc #7 (score: 0.0039): chesterton-ball.txt
Doc #8 (score: 0.0010): bible-kjv.txt

Your query 'whale hunter' matches the following documents:
Doc #0 (score: 0.0281): melville-moby_dick.txt
Doc #1 (score: 0.0239): bryant-stories.txt
Doc #2 (score: 0.0135): whitman-leaves.txt
Doc #3 (score: 0.0112): chesterton-ball.txt
Doc #4 (score: 0.0109): edgeworth-parents.txt
Doc #5 (score: 0.0094): shakespeare-hamlet.txt
Doc #6 (score: 0.008

There are many different ways term-document scores can be computed. In some approaches the query vector is not calculated in the same way as the document vectors. For instance, the idf factor may be used for query vectors, but left out from the document vectors. If you are interested, you can compare some different approaches on your data.

## Neural/Semantic Search

As we mentioned before, tf-idf is a straight-forward and explainable way of getting vectors for each document and query. We can also do that with more sophisticated approaches such as using a pre-trained model to obtain dense vectors.

We will use the model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), a compute-efficient sentence encoder. It is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

In [1]:
# We install the required libraries
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.4.0-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
Collecting tqdm (from sentence-transformers)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy (from sentence-transformers)
  Downloading scipy-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.28.0-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-transformers)
  Downloading pillow-11.1.

In [2]:
# We use a pretrained model from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # Small but effective model

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# We declare and encode our documents
documents = ["The Eiffel Tower is in Paris.",
            "Mount Everest is the highest mountain.",
            "Python is a popular programming language.",
            "Paris is the capital of France."]

doc_embeddings = model.encode(documents)
print(doc_embeddings.shape)
doc_embeddings[0]

(4, 384)


array([ 7.11015165e-02,  3.71286459e-02,  3.19642723e-02, -1.09879682e-02,
        2.63225827e-02, -1.67778153e-02, -7.40993172e-02,  1.36819319e-03,
        1.01486556e-02, -2.40272116e-02,  1.38697876e-02, -4.65226322e-02,
        2.55219303e-02, -1.01622239e-01,  1.59779098e-04, -6.43442497e-02,
        5.10249869e-04, -2.40349490e-02,  1.58969350e-02, -6.54078647e-02,
        5.26457392e-02, -1.10656351e-01,  2.65473500e-02,  1.28463060e-02,
       -9.26533118e-02, -8.47563520e-03, -7.23989159e-02,  2.67194901e-02,
       -1.18807601e-02, -4.85659353e-02,  6.14764802e-02, -1.46089867e-02,
       -8.23225379e-02,  6.90830573e-02, -1.60459038e-02,  3.92199233e-02,
        5.07549457e-02, -7.05964565e-02, -1.49870068e-02, -1.70641057e-02,
        1.61205996e-02, -2.17727330e-02, -2.79585179e-02,  1.10342484e-02,
       -3.07924841e-02, -2.76145842e-02, -1.82458125e-02,  7.33444048e-03,
        7.96843786e-03, -4.91103232e-02,  1.11571319e-01,  6.56414032e-02,
        1.18013667e-02, -

In [17]:
# We declare and encode our query
query = "Where is the Eiffel Tower?"
query_embedding = model.encode(query)
query_embedding

array([ 4.40369695e-02,  7.31283575e-02, -1.12389792e-02,  4.75877486e-02,
       -2.30635460e-02,  7.48442777e-04, -5.47064319e-02, -5.92756202e-04,
       -6.90620020e-03, -4.42262553e-02,  3.36680636e-02, -7.00417608e-02,
        4.30453382e-02, -9.29358676e-02,  9.47597530e-03, -2.49422397e-02,
       -3.48448334e-03, -2.46436000e-02,  2.02209074e-02, -9.00812745e-02,
        4.24851030e-02, -1.00372948e-01,  2.38506626e-02, -1.01565002e-02,
       -3.57761681e-02,  1.01311198e-02, -7.76870772e-02,  5.92276268e-02,
       -4.96612070e-03, -9.36788991e-02,  2.31905226e-02, -3.17058451e-02,
       -5.74658252e-02,  4.83381934e-02,  1.54677569e-03,  7.13536292e-02,
        6.79791942e-02, -4.21896838e-02,  3.38828005e-02, -2.25300696e-02,
       -1.14800623e-02,  4.04713955e-03,  8.37309007e-03,  1.69057958e-02,
       -3.76223251e-02, -2.61719935e-02, -2.80493032e-02, -1.40449535e-02,
        4.24632952e-02, -4.29379717e-02,  9.51183438e-02,  3.02840807e-02,
        2.70507876e-02, -

In [15]:
# We apply cosine similarity

# Search for most similar document
cosine_similarities = np.dot(query_embedding, doc_embeddings.T)

# Rank hits (higher is better)
ranked_doc_indices = np.argsort(cosine_similarities)[::-1]  # Sort descending

# Output results
print(f"Your query '{query}' matches the following documents:")
for i, doc_idx in enumerate(ranked_doc_indices):
    print(f"Doc #{i} (score: {cosine_similarities[doc_idx]:.4f}): {documents[doc_idx]}")

Your query 'Where is the Eiffel Tower?' matches the following documents:
Doc #0 (score: 0.8585): The Eiffel Tower is in Paris.
Doc #1 (score: 0.3081): Paris is the capital of France.
Doc #2 (score: 0.2097): Mount Everest is the highest mountain.
Doc #3 (score: -0.0083): Python is a popular programming language.
