# ColBERTv2: Indexing & Search Notebook

We start by importing the relevant classes. As we'll see below, `Indexer` and `Searcher` are the key actors here. 

In [7]:
import os
import sys
sys.path.insert(0, '../')

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

The workflow here assumes an IR dataset: a set of queries and a corresponding collection of passages.

The classes `Queries` and `Collection` provide a convenient interface for working with such datasets.

We will use the *dev set* of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely `lifestyle:dev`.

In [2]:
!mkdir -p downloads/

# ColBERTv2 checkpoint trained on MS MARCO Passage Ranking (388MB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz -P downloads/
!tar -xvzf downloads/colbertv2.0.tar.gz -C downloads/

# The LoTTE dev and test sets (3.4GB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz -P downloads/
!tar -xvzf downloads/lotte.tar.gz -C downloads/

--2023-02-12 15:54:09--  https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz
正在解析主机 downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
正在连接 downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度： 405924985 (387M) [application/octet-stream]
正在保存至: ‘downloads/colbertv2.0.tar.gz’


2023-02-12 15:55:26 (5.12 MB/s) - 已保存 ‘downloads/colbertv2.0.tar.gz’ [405924985/405924985])

colbertv2.0/
colbertv2.0/artifact.metadata
colbertv2.0/vocab.txt
colbertv2.0/tokenizer.json
colbertv2.0/special_tokens_map.json
colbertv2.0/tokenizer_config.json
colbertv2.0/config.json
colbertv2.0/pytorch_model.bin
--2023-02-12 15:55:28--  https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
正在解析主机 downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
正在连接 downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度： 3576167599 (3

In [31]:
dataroot = 'docs/downloads/lotte'
dataset = 'lifestyle'
datasplit = 'dev'

queries = os.path.join(dataroot, dataset, datasplit, 'questions.search.tsv')
collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv')

queries = Queries(path=queries)
collection = Collection(path=collection)

f'Loaded {len(queries)} queries and {len(collection):,} passages'

print("11")
print(collection[1145])

[Feb 23, 04:42:29] #> Loading the queries from docs/downloads/lotte/lifestyle/dev/questions.search.tsv ...


FileNotFoundError: [Errno 2] No such file or directory: 'docs/downloads/lotte/lifestyle/dev/questions.search.tsv'

This loaded 417 queries and 269k passages. Let's inspect one query and one passage.

In [10]:
print(queries[24])
print()
print(collection[1])
print()

are blossom end rot tomatoes edible?

"Good Harbor. Anita Diamant's international bestseller ""The Red Tent"" brilliantly re-created the ancient world of womanhood. Diamant brings her remarkable storytelling skills to ""Good Harbor"" -- offering insight to the precarious balance of marriage and career, motherhood and friendship in the world of modern women. The seaside town of Gloucester, Massachusetts is a place where the smell of the ocean lingers in the air and the rocky coast glistens in the Atlantic sunshine. When longtime Gloucester-resident Kathleen Levine is diagnosed with breast cancer, her life is thrown into turmoil. Frightened and burdened by secrets, she meets Joyce Tabachnik -- a freelance writer with literary aspirations -- and a once-in-a-lifetime friendship is born. Joyce has just bought a small house in Gloucester, where she hopes to write as well as vacation with her family. Like Kathleen, Joyce is at a fragile place in her life. A mutual love for books, humor, and t

## Indexing

For efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

(With four Titan V GPUs, indexing should take about 13 minutes. The output is fairly long/ugly at the moment!)

In [11]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300   # truncate passages at 300 tokens

checkpoint = 'downloads/colbertv2.0'
index_name = f'{dataset}.{datasplit}.{nbits}bits'

In [12]:
with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use.
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection, overwrite=True)



[Feb 23, 01:13:42] #> Note: Output directory /home/adqaicp/documents/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits already exists


[Feb 23, 01:13:42] #> Will delete 50 files already at /home/adqaicp/documents/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits in 20 seconds...
#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 20,
    "resume": false,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 1e-5,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": 20000,
    "warmup_bert": null,
    "relu": false,
    "nway": 64,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "bert-base-uncase

0it [00:00, ?it/s]

[Feb 23, 02:32:33] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:33:23] [0] 		 #> Saving chunk 0: 	 25,000 passages and 3,531,874 embeddings. From #0 onward.


1it [00:57, 57.02s/it]

[Feb 23, 02:33:30] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:34:20] [0] 		 #> Saving chunk 1: 	 25,000 passages and 3,511,047 embeddings. From #25,000 onward.


2it [01:53, 56.88s/it]

[Feb 23, 02:34:27] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:35:17] [0] 		 #> Saving chunk 2: 	 25,000 passages and 3,519,290 embeddings. From #50,000 onward.


3it [02:50, 56.95s/it]

[Feb 23, 02:35:24] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:36:14] [0] 		 #> Saving chunk 3: 	 25,000 passages and 3,502,249 embeddings. From #75,000 onward.


4it [03:47, 56.90s/it]

[Feb 23, 02:36:21] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:37:11] [0] 		 #> Saving chunk 4: 	 25,000 passages and 3,488,977 embeddings. From #100,000 onward.


5it [04:44, 56.87s/it]

[Feb 23, 02:37:18] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:38:08] [0] 		 #> Saving chunk 5: 	 25,000 passages and 3,520,990 embeddings. From #125,000 onward.


6it [05:41, 56.87s/it]

[Feb 23, 02:38:15] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:39:05] [0] 		 #> Saving chunk 6: 	 25,000 passages and 3,519,026 embeddings. From #150,000 onward.


7it [06:38, 56.91s/it]

[Feb 23, 02:39:12] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:40:02] [0] 		 #> Saving chunk 7: 	 25,000 passages and 3,489,306 embeddings. From #175,000 onward.


8it [07:35, 56.93s/it]

[Feb 23, 02:40:09] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:40:59] [0] 		 #> Saving chunk 8: 	 25,000 passages and 3,526,326 embeddings. From #200,000 onward.


9it [08:32, 56.93s/it]

[Feb 23, 02:41:06] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:41:56] [0] 		 #> Saving chunk 9: 	 25,000 passages and 3,511,564 embeddings. From #225,000 onward.


10it [09:29, 56.91s/it]

[Feb 23, 02:42:02] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:42:53] [0] 		 #> Saving chunk 10: 	 25,000 passages and 3,520,353 embeddings. From #250,000 onward.


11it [10:25, 56.90s/it]

[Feb 23, 02:42:59] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:43:50] [0] 		 #> Saving chunk 11: 	 25,000 passages and 3,496,385 embeddings. From #275,000 onward.


12it [11:23, 56.95s/it]

[Feb 23, 02:43:56] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:44:47] [0] 		 #> Saving chunk 12: 	 25,000 passages and 3,542,408 embeddings. From #300,000 onward.


13it [12:20, 57.07s/it]

[Feb 23, 02:44:54] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:45:44] [0] 		 #> Saving chunk 13: 	 25,000 passages and 3,512,796 embeddings. From #325,000 onward.


14it [13:17, 57.12s/it]

[Feb 23, 02:45:51] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:46:42] [0] 		 #> Saving chunk 14: 	 25,000 passages and 3,530,088 embeddings. From #350,000 onward.


15it [14:15, 57.23s/it]

[Feb 23, 02:46:49] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:47:39] [0] 		 #> Saving chunk 15: 	 25,000 passages and 3,480,406 embeddings. From #375,000 onward.


16it [15:12, 57.33s/it]

[Feb 23, 02:47:46] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:48:36] [0] 		 #> Saving chunk 16: 	 25,000 passages and 3,492,325 embeddings. From #400,000 onward.


17it [16:09, 57.16s/it]

[Feb 23, 02:48:43] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:49:33] [0] 		 #> Saving chunk 17: 	 25,000 passages and 3,490,779 embeddings. From #425,000 onward.


18it [17:06, 57.22s/it]

[Feb 23, 02:49:40] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:50:30] [0] 		 #> Saving chunk 18: 	 25,000 passages and 3,527,422 embeddings. From #450,000 onward.


19it [18:03, 57.14s/it]

[Feb 23, 02:50:37] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:51:27] [0] 		 #> Saving chunk 19: 	 25,000 passages and 3,496,300 embeddings. From #475,000 onward.


20it [19:00, 57.09s/it]

[Feb 23, 02:51:34] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:52:24] [0] 		 #> Saving chunk 20: 	 25,000 passages and 3,517,603 embeddings. From #500,000 onward.


21it [19:57, 57.02s/it]

[Feb 23, 02:52:31] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:53:21] [0] 		 #> Saving chunk 21: 	 25,000 passages and 3,505,067 embeddings. From #525,000 onward.


22it [20:54, 56.97s/it]

[Feb 23, 02:53:28] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:54:18] [0] 		 #> Saving chunk 22: 	 25,000 passages and 3,484,407 embeddings. From #550,000 onward.


23it [21:51, 56.90s/it]

[Feb 23, 02:54:25] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:55:15] [0] 		 #> Saving chunk 23: 	 25,000 passages and 3,520,017 embeddings. From #575,000 onward.


24it [22:48, 56.91s/it]

[Feb 23, 02:55:22] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:56:12] [0] 		 #> Saving chunk 24: 	 25,000 passages and 3,495,823 embeddings. From #600,000 onward.


25it [23:45, 57.05s/it]

[Feb 23, 02:56:19] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:57:09] [0] 		 #> Saving chunk 25: 	 25,000 passages and 3,485,067 embeddings. From #625,000 onward.


26it [24:42, 57.14s/it]

[Feb 23, 02:57:16] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:58:07] [0] 		 #> Saving chunk 26: 	 25,000 passages and 3,491,009 embeddings. From #650,000 onward.


27it [25:40, 57.25s/it]

[Feb 23, 02:58:14] [0] 		 #> Encoding 25000 passages..
[Feb 23, 02:59:04] [0] 		 #> Saving chunk 27: 	 25,000 passages and 3,486,524 embeddings. From #675,000 onward.


28it [26:37, 57.12s/it]

[Feb 23, 02:59:11] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:00:02] [0] 		 #> Saving chunk 28: 	 25,000 passages and 3,502,630 embeddings. From #700,000 onward.


29it [27:35, 57.37s/it]

[Feb 23, 03:00:09] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:01:00] [0] 		 #> Saving chunk 29: 	 25,000 passages and 3,478,553 embeddings. From #725,000 onward.


30it [28:33, 57.65s/it]

[Feb 23, 03:01:07] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:01:58] [0] 		 #> Saving chunk 30: 	 25,000 passages and 3,501,648 embeddings. From #750,000 onward.


31it [29:31, 57.66s/it]

[Feb 23, 03:02:05] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:02:56] [0] 		 #> Saving chunk 31: 	 25,000 passages and 3,513,308 embeddings. From #775,000 onward.


32it [30:29, 57.79s/it]

[Feb 23, 03:03:03] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:03:55] [0] 		 #> Saving chunk 32: 	 25,000 passages and 3,472,200 embeddings. From #800,000 onward.


33it [31:28, 58.12s/it]

[Feb 23, 03:04:01] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:04:53] [0] 		 #> Saving chunk 33: 	 25,000 passages and 3,524,446 embeddings. From #825,000 onward.


34it [32:26, 58.24s/it]

[Feb 23, 03:05:00] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:05:50] [0] 		 #> Saving chunk 34: 	 25,000 passages and 3,519,765 embeddings. From #850,000 onward.


35it [33:23, 57.92s/it]

[Feb 23, 03:05:57] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:06:49] [0] 		 #> Saving chunk 35: 	 25,000 passages and 3,526,466 embeddings. From #875,000 onward.


36it [34:22, 58.26s/it]

[Feb 23, 03:06:56] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:07:47] [0] 		 #> Saving chunk 36: 	 25,000 passages and 3,496,219 embeddings. From #900,000 onward.


37it [35:20, 58.21s/it]

[Feb 23, 03:07:54] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:08:45] [0] 		 #> Saving chunk 37: 	 25,000 passages and 3,485,037 embeddings. From #925,000 onward.


38it [36:17, 57.84s/it]

[Feb 23, 03:08:51] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:09:41] [0] 		 #> Saving chunk 38: 	 25,000 passages and 3,524,718 embeddings. From #950,000 onward.


39it [37:14, 57.58s/it]

[Feb 23, 03:09:48] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:10:39] [0] 		 #> Saving chunk 39: 	 25,000 passages and 3,481,003 embeddings. From #975,000 onward.


40it [38:12, 57.56s/it]

[Feb 23, 03:10:46] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:11:37] [0] 		 #> Saving chunk 40: 	 25,000 passages and 3,505,696 embeddings. From #1,000,000 onward.


41it [39:10, 57.77s/it]

[Feb 23, 03:11:44] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:12:34] [0] 		 #> Saving chunk 41: 	 25,000 passages and 3,519,486 embeddings. From #1,025,000 onward.


42it [40:07, 57.51s/it]

[Feb 23, 03:12:41] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:13:31] [0] 		 #> Saving chunk 42: 	 25,000 passages and 3,512,006 embeddings. From #1,050,000 onward.


43it [41:04, 57.32s/it]

[Feb 23, 03:13:38] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:14:28] [0] 		 #> Saving chunk 43: 	 25,000 passages and 3,515,841 embeddings. From #1,075,000 onward.


44it [42:01, 57.16s/it]

[Feb 23, 03:14:35] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:15:26] [0] 		 #> Saving chunk 44: 	 25,000 passages and 3,492,342 embeddings. From #1,100,000 onward.


45it [42:59, 57.62s/it]

[Feb 23, 03:15:33] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:16:25] [0] 		 #> Saving chunk 45: 	 25,000 passages and 3,517,306 embeddings. From #1,125,000 onward.


46it [43:59, 58.08s/it]

[Feb 23, 03:16:32] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:17:25] [0] 		 #> Saving chunk 46: 	 25,000 passages and 3,521,635 embeddings. From #1,150,000 onward.


47it [44:58, 58.51s/it]

[Feb 23, 03:17:32] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:18:24] [0] 		 #> Saving chunk 47: 	 25,000 passages and 3,499,550 embeddings. From #1,175,000 onward.


48it [45:57, 58.65s/it]

[Feb 23, 03:18:31] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:19:23] [0] 		 #> Saving chunk 48: 	 25,000 passages and 3,513,235 embeddings. From #1,200,000 onward.


49it [46:56, 58.80s/it]

[Feb 23, 03:19:30] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:20:22] [0] 		 #> Saving chunk 49: 	 25,000 passages and 3,498,891 embeddings. From #1,225,000 onward.


50it [47:55, 58.70s/it]

[Feb 23, 03:20:29] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:21:20] [0] 		 #> Saving chunk 50: 	 25,000 passages and 3,513,046 embeddings. From #1,250,000 onward.


51it [48:53, 58.73s/it]

[Feb 23, 03:21:27] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:22:20] [0] 		 #> Saving chunk 51: 	 25,000 passages and 3,495,427 embeddings. From #1,275,000 onward.


52it [49:54, 59.21s/it]

[Feb 23, 03:22:28] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:23:20] [0] 		 #> Saving chunk 52: 	 25,000 passages and 3,494,335 embeddings. From #1,300,000 onward.


53it [50:53, 59.22s/it]

[Feb 23, 03:23:27] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:24:19] [0] 		 #> Saving chunk 53: 	 25,000 passages and 3,526,543 embeddings. From #1,325,000 onward.


54it [51:52, 59.12s/it]

[Feb 23, 03:24:26] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:25:17] [0] 		 #> Saving chunk 54: 	 25,000 passages and 3,509,530 embeddings. From #1,350,000 onward.


55it [52:50, 58.82s/it]

[Feb 23, 03:25:24] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:26:14] [0] 		 #> Saving chunk 55: 	 25,000 passages and 3,511,963 embeddings. From #1,375,000 onward.


56it [53:47, 58.37s/it]

[Feb 23, 03:26:21] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:27:11] [0] 		 #> Saving chunk 56: 	 25,000 passages and 3,518,692 embeddings. From #1,400,000 onward.


57it [54:44, 57.93s/it]

[Feb 23, 03:27:18] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:28:08] [0] 		 #> Saving chunk 57: 	 25,000 passages and 3,534,418 embeddings. From #1,425,000 onward.


58it [55:41, 57.68s/it]

[Feb 23, 03:28:15] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:29:06] [0] 		 #> Saving chunk 58: 	 25,000 passages and 3,512,066 embeddings. From #1,450,000 onward.


59it [56:39, 57.63s/it]

[Feb 23, 03:29:13] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:30:05] [0] 		 #> Saving chunk 59: 	 25,000 passages and 3,486,280 embeddings. From #1,475,000 onward.


60it [57:38, 58.06s/it]

[Feb 23, 03:30:12] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:31:03] [0] 		 #> Saving chunk 60: 	 25,000 passages and 3,506,326 embeddings. From #1,500,000 onward.


61it [58:36, 58.20s/it]

[Feb 23, 03:31:10] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:32:02] [0] 		 #> Saving chunk 61: 	 25,000 passages and 3,516,145 embeddings. From #1,525,000 onward.


62it [59:35, 58.31s/it]

[Feb 23, 03:32:09] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:33:01] [0] 		 #> Saving chunk 62: 	 25,000 passages and 3,510,765 embeddings. From #1,550,000 onward.


63it [1:00:34, 58.44s/it]

[Feb 23, 03:33:08] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:33:59] [0] 		 #> Saving chunk 63: 	 25,000 passages and 3,482,178 embeddings. From #1,575,000 onward.


64it [1:01:31, 58.21s/it]

[Feb 23, 03:34:05] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:34:56] [0] 		 #> Saving chunk 64: 	 25,000 passages and 3,510,138 embeddings. From #1,600,000 onward.


65it [1:02:29, 58.07s/it]

[Feb 23, 03:35:03] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:35:54] [0] 		 #> Saving chunk 65: 	 25,000 passages and 3,511,113 embeddings. From #1,625,000 onward.


66it [1:03:27, 57.94s/it]

[Feb 23, 03:36:01] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:36:52] [0] 		 #> Saving chunk 66: 	 25,000 passages and 3,545,687 embeddings. From #1,650,000 onward.


67it [1:04:25, 57.95s/it]

[Feb 23, 03:36:59] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:37:50] [0] 		 #> Saving chunk 67: 	 25,000 passages and 3,489,005 embeddings. From #1,675,000 onward.


68it [1:05:23, 58.00s/it]

[Feb 23, 03:37:57] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:38:47] [0] 		 #> Saving chunk 68: 	 25,000 passages and 3,492,689 embeddings. From #1,700,000 onward.


69it [1:06:20, 57.67s/it]

[Feb 23, 03:38:54] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:39:44] [0] 		 #> Saving chunk 69: 	 25,000 passages and 3,502,892 embeddings. From #1,725,000 onward.


70it [1:07:17, 57.49s/it]

[Feb 23, 03:39:51] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:40:41] [0] 		 #> Saving chunk 70: 	 25,000 passages and 3,491,321 embeddings. From #1,750,000 onward.


71it [1:08:14, 57.29s/it]

[Feb 23, 03:40:48] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:41:38] [0] 		 #> Saving chunk 71: 	 25,000 passages and 3,513,498 embeddings. From #1,775,000 onward.


72it [1:09:11, 57.17s/it]

[Feb 23, 03:41:45] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:42:34] [0] 		 #> Saving chunk 72: 	 25,000 passages and 3,517,857 embeddings. From #1,800,000 onward.


73it [1:10:07, 57.06s/it]

[Feb 23, 03:42:41] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:43:31] [0] 		 #> Saving chunk 73: 	 25,000 passages and 3,501,410 embeddings. From #1,825,000 onward.


74it [1:11:04, 57.04s/it]

[Feb 23, 03:43:38] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:44:28] [0] 		 #> Saving chunk 74: 	 25,000 passages and 3,511,496 embeddings. From #1,850,000 onward.


75it [1:12:01, 56.98s/it]

[Feb 23, 03:44:35] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:45:27] [0] 		 #> Saving chunk 75: 	 25,000 passages and 3,501,942 embeddings. From #1,875,000 onward.


76it [1:13:00, 57.56s/it]

[Feb 23, 03:45:34] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:46:27] [0] 		 #> Saving chunk 76: 	 25,000 passages and 3,512,472 embeddings. From #1,900,000 onward.


77it [1:14:00, 58.19s/it]

[Feb 23, 03:46:34] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:47:25] [0] 		 #> Saving chunk 77: 	 25,000 passages and 3,484,022 embeddings. From #1,925,000 onward.


78it [1:14:57, 58.02s/it]

[Feb 23, 03:47:31] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:48:21] [0] 		 #> Saving chunk 78: 	 25,000 passages and 3,519,119 embeddings. From #1,950,000 onward.


79it [1:15:54, 57.49s/it]

[Feb 23, 03:48:28] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:49:17] [0] 		 #> Saving chunk 79: 	 25,000 passages and 3,495,361 embeddings. From #1,975,000 onward.


80it [1:16:50, 56.98s/it]

[Feb 23, 03:49:23] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:50:13] [0] 		 #> Saving chunk 80: 	 25,000 passages and 3,495,482 embeddings. From #2,000,000 onward.


81it [1:17:45, 56.67s/it]

[Feb 23, 03:50:19] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:51:10] [0] 		 #> Saving chunk 81: 	 25,000 passages and 3,510,742 embeddings. From #2,025,000 onward.


82it [1:18:42, 56.78s/it]

[Feb 23, 03:51:16] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:52:06] [0] 		 #> Saving chunk 82: 	 25,000 passages and 3,460,015 embeddings. From #2,050,000 onward.


83it [1:19:39, 56.60s/it]

[Feb 23, 03:52:13] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:53:02] [0] 		 #> Saving chunk 83: 	 25,000 passages and 3,516,439 embeddings. From #2,075,000 onward.


84it [1:20:35, 56.63s/it]

[Feb 23, 03:53:09] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:54:00] [0] 		 #> Saving chunk 84: 	 25,000 passages and 3,501,295 embeddings. From #2,100,000 onward.


85it [1:21:33, 56.92s/it]

[Feb 23, 03:54:07] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:54:57] [0] 		 #> Saving chunk 85: 	 25,000 passages and 3,503,294 embeddings. From #2,125,000 onward.


86it [1:22:30, 56.85s/it]

[Feb 23, 03:55:04] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:55:53] [0] 		 #> Saving chunk 86: 	 25,000 passages and 3,511,829 embeddings. From #2,150,000 onward.


87it [1:23:26, 56.57s/it]

[Feb 23, 03:55:59] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:56:49] [0] 		 #> Saving chunk 87: 	 25,000 passages and 3,499,856 embeddings. From #2,175,000 onward.


88it [1:24:22, 56.59s/it]

[Feb 23, 03:56:56] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:57:45] [0] 		 #> Saving chunk 88: 	 25,000 passages and 3,514,538 embeddings. From #2,200,000 onward.


89it [1:25:18, 56.40s/it]

[Feb 23, 03:57:52] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:58:41] [0] 		 #> Saving chunk 89: 	 25,000 passages and 3,493,010 embeddings. From #2,225,000 onward.


90it [1:26:14, 56.27s/it]

[Feb 23, 03:58:48] [0] 		 #> Encoding 25000 passages..
[Feb 23, 03:59:37] [0] 		 #> Saving chunk 90: 	 25,000 passages and 3,502,251 embeddings. From #2,250,000 onward.


91it [1:27:10, 56.20s/it]

[Feb 23, 03:59:44] [0] 		 #> Encoding 25000 passages..
[Feb 23, 04:00:33] [0] 		 #> Saving chunk 91: 	 25,000 passages and 3,498,252 embeddings. From #2,275,000 onward.


92it [1:28:06, 56.07s/it]

[Feb 23, 04:00:40] [0] 		 #> Encoding 25000 passages..
[Feb 23, 04:01:29] [0] 		 #> Saving chunk 92: 	 25,000 passages and 3,509,032 embeddings. From #2,300,000 onward.


93it [1:29:02, 56.01s/it]

[Feb 23, 04:01:36] [0] 		 #> Encoding 25000 passages..
[Feb 23, 04:02:25] [0] 		 #> Saving chunk 93: 	 25,000 passages and 3,501,409 embeddings. From #2,325,000 onward.


94it [1:29:58, 55.92s/it]

[Feb 23, 04:02:31] [0] 		 #> Encoding 10655 passages..
[Feb 23, 04:02:52] [0] 		 #> Saving chunk 94: 	 10,655 passages and 1,495,456 embeddings. From #2,350,000 onward.


95it [1:30:21, 46.29s/it]95it [1:30:21, 57.07s/it]

[Feb 23, 04:02:55] [0] 		 #> Checking all files were saved...
[Feb 23, 04:02:55] [0] 		 Found all files!
[Feb 23, 04:02:55] [0] 		 #> Building IVF...
[Feb 23, 04:02:56] [0] 		 #> Loading codes...



100%|██████████| 95/95 [00:00<00:00, 241.71it/s]

[Feb 23, 04:02:56] [0] 		 Sorting codes...
[Feb 23, 04:03:18] [0] 		 Getting unique codes...
[Feb 23, 04:03:19] #> Optimizing IVF to store map from centroids to list of pids..
[Feb 23, 04:03:19] #> Building the emb2pid mapping..
[Feb 23, 04:03:24] len(emb2pid) = 331048045



100%|██████████| 262144/262144 [00:36<00:00, 7259.07it/s]

[Feb 23, 04:04:02] #> Saved optimized IVF to /home/adqaicp/documents/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/ivf.pid.pt
[Feb 23, 04:04:03] [0] 		 #> Saving the indexing metadata to /home/adqaicp/documents/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/metadata.json ..





#> Joined...


In [27]:
indexer = Indexer(checkpoint=checkpoint, config=config)
indexer.index(name=index_name, collection=collection, overwrite=False)

AssertionError: /home/adqaicp/documents/ColBERT/docs/experiments/default/indexes/lifestyle.dev.2bits

In [28]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/home/adqaicp/documents/ColBERT/docs/experiments/default/indexes/lifestyle.dev.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [29]:
# import os
# import sys
# sys.path.insert(0, '../')

# from colbert.infra import Run, RunConfig, ColBERTConfig
# from colbert.data import Queries, Collection
# from colbert import Indexer, Searcher

In [30]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).

[Feb 23, 04:40:08] #> Loading collection...
0M 1M 2M 
[Feb 23, 04:40:16] #> Loading codec...
[Feb 23, 04:40:16] #> Loading IVF...
[Feb 23, 04:40:16] #> Loading doclens...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 95/95 [00:00<00:00, 1821.68it/s]

[Feb 23, 04:40:16] #> Loading codes and residuals...



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 95/95 [00:03<00:00, 29.06it/s]


In [23]:
# query = queries[37]   # or supply your own query
# query = "what is applewatch?"
# query = queries[31]
# query = "hello"
# query = "learn SQL"
# query = "prepare a luggage for travel"
query = "Chinese cooking recepit"
# query = queries[3]


print(f"#> {query}")

# Find the top-3 passages for this query 【query数量】
results = searcher.search(query, k=10)
# results = searcher.search(query, k=10)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> Chinese cooking recepit
	 [1] 		 18.6 		 Chinese Cooking. Cookbook.
	 [2] 		 18.6 		 Chinese Cooking. .
	 [3] 		 18.5 		 Dim Sum: Dumplings, Parcels and Other Delectable Chinese Snacks in 25 Authentic Recipes. Dim sum is a traditional style of eating, where bite-sized tidbits are served for shared dining. This book makes authentic Chinese dim sum accessible to the home cook. It opens with a practical introduction to the cuisine, with essential information on ingredients and equipment. More than 25 recipes follow, with a diverse selection of dishes from all over China..
	 [4] 		 18.5 		 The Everything Chinese Cookbook: From Wonton Soup to Sweet and Sour Chicken-300 Succelent Recipes from the Far East. Featuring hundreds of recipes, such as Snow Pea Stir-fry, Hot Chicken Salad, General Tso's Chicken, and Traditional Mu Shu Pork, The Everything Chinese Cookbookmakes preparing authentic Chinese dishes fun and easy! From basic Chinese flavors and dipping sauces, such as Quick and Easy Sw

## Batch Search

In many applications, you have a large batch of queries and you need to maximize the overall throughput. For that, you can use the `searcher.search_all(queries, k)` method, which returns a `Ranking` object that organizes the results across all queries.

(Batching provides many opportunities for higher-throughput search, though we have not implemented most of those optimizations for compressed indexes yet.)

In [16]:
rankings = searcher.search_all(queries, k=5).todict()

100%|████████████████████████████████████| 417/417 [00:01<00:00, 224.98it/s]


In [18]:
rankings[30]  # For query 30, a list of (passage_id, rank, score) for the top-k passages

[(24367, 1, 16.078125),
 (35359, 2, 15.8046875),
 (131545, 3, 15.7421875),
 (3789, 4, 15.7421875),
 (25089, 5, 15.6640625)]