# ColBERT: Indexing & Search Notebook v0.1

We start by importing the relevant classes. As we'll see below, `Indexer` and `Searcher` are the key actors here. 

In [1]:
!git clone -b new_api https://github.com/stanford-futuredata/ColBERT.git

Cloning into 'ColBERT'...
remote: Enumerating objects: 516, done.[K
remote: Counting objects: 100% (356/356), done.[K
remote: Compressing objects: 100% (264/264), done.[K
remote: Total 516 (delta 169), reused 220 (delta 87), pack-reused 160[K
Receiving objects: 100% (516/516), 255.34 KiB | 3.99 MiB/s, done.
Resolving deltas: 100% (224/224), done.


In [2]:
%cd ColBERT

/content/ColBERT


In [None]:
!apt-get update
!apt-get install git

In [15]:
!pip install ujson
!pip install GitPython
!pip install transformers==3.0.2
!pip install faiss-gpu==1.6.3

Collecting faiss-gpu==1.6.3
  Downloading faiss_gpu-1.6.3-cp37-cp37m-manylinux2010_x86_64.whl (35.5 MB)
[K     |████████████████████████████████| 35.5 MB 265 kB/s 
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.6.3


In [16]:
import os
import sys
sys.path.insert(0, '../')

from colbert.infra import Run, RunConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

The workflow here assumes an IR dataset: a set of queries and a corresponding collection of passages.

The classes `Queries` and `Collection` provide a convenient interface for working with such datasets.

We'll load the answer posts of the English Language Learners (ELL) StackExchange community as our collection, and use relevant GooAQ questions as our queries.

In [17]:
dataroot = '/future/u/okhattab/data/tmp/stackexchange/'
dataset = 'ell'

queries = os.path.join(dataroot, dataset, 'questions.tsv')
collection = os.path.join(dataroot, dataset, 'collection.answeronly.tsv')

queries = Queries(path=queries)
collection = Collection(path=collection)

f'Loaded {len(queries)} queries and {len(collection):,} passages'

[Dec 23, 02:52:41] #> Loading the queries from /future/u/okhattab/data/tmp/stackexchange/ell/questions.tsv ...


FileNotFoundError: ignored

This loaded 441 queries and 214,300 passages, numbered (in this dataset) from 0 onwathrough 440. Let's inspect one query and one passage.

In [None]:
print(queries[23])
print()
print(collection[79852])
print()

are disease names proper nouns?

No, “chance” and “get a chance” do not mean the same thing. “Get a chance” means “have an opportunity”, but the verb “chance” alone doesn't. The verb “chance” has several meanings, but none of them work here. In the sense of something happening by luck, it's normally used in a past tense (“I never chanced to meet him” = “I was not lucky enough to meet him”) or hypothetically (“If you chance to meet him, say hello from me” = “If you luck into meeting him, say hello from me”). It doesn't make sense here since by definition chance implies that



## Indexing

For efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

(On future machines with four Titan V GPUs, indexing should take 6--7 minutes. The output is fairly ugly at the moment!)

In [None]:
with Run().context(RunConfig(nranks=4, experiment='notebook')):  # nranks specifies the number of GPUs to use.
    index_name = f'{dataset}.index'

    indexer = Indexer(checkpoint='/dfs/scratch0/okhattab/OpenQA/colbert-400000.dnn')  # MS MARCO ColBERT checkpoint
    indexer.index(name=index_name, collection=collection, overwrite=True)

    searcher = Searcher(index=index_name)

{
    "DocSettings": {
        "dim": 128,
        "doc_maxlen": 220,
        "mask_punctuation": true
    },
    "IndexingSettings": {
        "centroid_fraction_of_sample": 0.03,
        "chunksize": 2.0,
        "compression_level": 1,
        "compression_thresholds": "\/future\/u\/keshav2\/compression_thresholds.csv",
        "index_root": null,
        "kmeans_niters": 20,
        "kmeans_spherical": true,
        "partitions": null,
        "sample": 0.05
    },
    "QuerySettings": {
        "query_maxlen": 32
    },
    "ResourceSettings": {
        "checkpoint": "\/dfs\/scratch0\/okhattab\/OpenQA\/colbert-400000.dnn",
        "collection": "\/future\/u\/okhattab\/data\/tmp\/stackexchange\/ell\/collection.answeronly.tsv",
        "index_name": "ell.index",
        "queries": null,
        "triples": null
    },
    "RunSettings": {
        "amp": true,
        "experiment": "notebook",
        "gpus": [
            0,
            1,
            2,
            3
        ],
    

0it [00:00, ?it/s]

[Sep 21, 19:53:36] [0] 		 #> Encoding 25000 passages..
[Sep 21, 19:53:36] [2] 		 #> Encoding 25000 passages..
[Sep 21, 19:53:36] [1] 		 #> Encoding 25000 passages..
[Sep 21, 19:53:36] [3] 		 #> Encoding 25000 passages..
[Sep 21, 19:54:24] [0] 		 #> Saving chunk 0: 	 25,000 passages and 2,050,051 embeddings. From #0 onward.
[Sep 21, 19:54:32] [1] 		 #> Encoding 25000 passages..
[Sep 21, 19:54:32] [3] 		 #> Encoding 25000 passages..
[Sep 21, 19:54:33] [2] 		 #> Encoding 25000 passages..


1it [00:57, 57.25s/it]

[Sep 21, 19:54:33] [0] 		 #> Encoding 25000 passages..
[Sep 21, 19:55:21] [0] 		 #> Saving chunk 4: 	 25,000 passages and 1,922,803 embeddings. From #100,000 onward.


2it [01:52, 56.37s/it]

[Sep 21, 19:55:29] [0] 		 #> Encoding 14300 passages..
[Sep 21, 19:55:56] [0] 		 #> Saving chunk 8: 	 14,300 passages and 1,225,125 embeddings. From #200,000 onward.


3it [02:25, 48.64s/it]

[Sep 21, 19:56:10] [0] 		 #> Saving the indexing metadata to /future/u/okhattab/repos/ColBERT-private-releases/experiments/notebook/indexes/ell.index/metadata.json ..





#> Joined...
#> Joined...
#> Joined...
#> Joined...
[Sep 21, 19:56:13] #> Loading collection...
0M 
[Sep 21, 19:56:27] #> Building the emb2pid mapping..
[Sep 21, 19:56:28] len(self.emb2pid) = 16976942


## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions, say, "how large is the vocabulary of the average person?"

Feel free to get creative! But keep in mind this set of ~200k ELL passages can only answer a small, focused set of questions on English Language Learning.

In [None]:
query = queries[30]   # or supply your own query

print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> are morphology and structure the same?
	 [1] 		 21.3 		 This is tricky one, I'll try and explain my perspective as a native British English speaker. I have a biology background as opposed to physics, so that might come into play as well here. Some defintions: Structure: Arrangement of and relations between the parts or elements of something complex or a piece of construction. Whereas morphology: Morphology: A particular form, shape, or structure or the study of something's form of shape. As you can see, they both have one usage where they are near enough synonymous, and another where they are not. In most situations they can be
	 [2] 		 19.6 		 Since each program has one structure, the correct sentence is Program A and program B have the same structure: they both have a sequential structure. There is a single structure, which is shared by the two programs. The “same structure” in the first sentence is the sequential structure, there is only one. Other ways to formulate this idea inc

## Batch Search

In many applications, you have a large batch of queries and you need to maximize the overall throughput. For that, you can use the `searcher.search_all(queries, k)` method, which returns a `Ranking` object that organizes the results across all queries.

(Batching provides many opportunities for higher-throughput search, though we have not implemented most of those optimizations for compressed indexes yet.)

In [None]:
rankings = searcher.search_all(queries, k=5).todict()

In [None]:
rankings[30]  # For query 30, a list of (passage_id, rank, score) for the top-k passages

[(54519, 1, 21.296875),
 (9029, 2, 19.625),
 (177893, 3, 19.4375),
 (156910, 4, 19.40625),
 (10821, 5, 19.34375)]